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1.0  Executive  Summary 


1.1  Overview 

The  purpose  of  this  program  was  to  explore  algorithms  for  operational 
speaker  identification  systems.  Algorithms  developed  in  the  laboratory  have  not 
been  robust  to  the  short,  noisy  transmissions  generally  found  in  operational 
communications.  Because  of  these  reasons,  we  evaluated  new  emerging  methods 
and  reevaluated  some  basic  assumptions  in  speaker  identification  technology. 
The  most  important  notion  we  challenged  was  that  high  performance  speaker 
identification  was  not  possible  with  very  short  training  utterances  of 
approximately  five  seconds. 

Our  approach  was  to  compare  traditional  speech  parameters  with  new 
parameters  obtained  from  auditory-like  models.  Recent  reported  performance 
improvements  in  automatic  speech  recognition  systems  using  auditory-like 
models  motivated  the  approach  [1]  [2].  The  results  of  these  recent  research 
efforts  suggest  that  auditory-like  models  lessen  the  detrimental  effects  of  noise 
and  non-stationary  chaimels.  In  addition  to  the  parameter  comparative  analysis, 
we  compared  a  well  known  clustering  classifier  called,  the  LBG  Vector 
Quantizer,  with  the  backpropagation  and  recurrent  backpropagation  neural 
network  classifiers.  Primarily,  in  the  classifier  study,  we  wanted  to  determine 
whether  the  recurrent  network  could  learn  a  speaker's  unique  spatio-temporal 
patterns  for  enhanced  speaker  identification  performance. 

We  used  two  speech  databases  to  evaluate  the  developed  speaker 
identification  algorithms.  We  used  the  KING  database,  commonly  used  in 
speaker  identification  research,  to  baseline  algorithm  performance.  The  second 
database  we  used,  the  GREENFLAG  database,  is  comprised  of  recordings  of  off- 
the-air  transmissions  of  US  Air  Force  pilots  in  Green  Flag  exercises  at  Nellis  Air 
Force  Base.  We  used  the  GREENFLAG  database  to  test  the  developed  algorithms 
in  as  close  to  an  operational  environment  as  possible. 

1.2  Summary  of  Results  and  Their  Significance 

One  of  our  main  findings  is  that  there  is  no  single  speech  parameter  that 
consistently  captures  the  discriminating  features  of  a  speaker's  voice.  The 
channel  characteristics  of  the  communications  medium  dictate  what  parameters 
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and  parameter  normalization  techniques  should  be  used  for  optimal 
discrimination. 

Once  we  established  the  best  set  of  parameters  for  a  given  channel  condition, 
we  used  a  simple  but  effective  parameter  fusion  method  that  increased  overall 
performance.  The  combination  of  finding  the  appropriate  set  of  parameters  for 
the  given  channel,  and  then  fusing  those  parameters  during  classification  was  the 
key  to  high  performance  speaker  identification. 

We  also  found  that  the  simple  vector  quantizer  outperformed  the  neural 
network  classifiers  we  studied.  Another  drawback  to  the  backpropagation 
neural  networks  is  the  amount  of  training  time  they  require  for  achieving  good 
performance.  The  neural  networks  we  investigated  took  days  to  train.  The  vector 
quantizer,  on  the  other  hand,  required  only  tens  of  minutes  to  train. 

In  the  baseline  experiments  we  were  able  to  match  or  exceed  published 
results.  With  a  commonly  used  26  speaker  subset  of  the  KING  database,  we  were 
able  to  attain  100%  recognition.  Most  importantly,  we  achieved  nearly  96% 
recognition  of  41  pilots  using  a  total  of  159  test  transmissions  in  GREENFLAG 
database  experiments.  We  obtained  this  performance  with  an  average  training 
transmission  length  of  4.2  seconds  and  an  average  testing  transmission  length  of 
2.4  seconds. 

These  results  are  very  encouraging.  We  believe  that  high  performance  tactical 
and  strategic  operational  speaker  identification  systems  can  be  developed  with 
the  available  algorithms  and  commercially  available  computational  platforms. 
Portable  field  units  can  be  deployed  in  less  than  three  years. 
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2.0  Introduction 


2 . 1  Organization  of  the  Report 

This  report  is  written  in  three  main  parts  and  in  three  levels.  The  first  part  is 
a  high  level  description  of  the  effort  and  is  foimd  in  the  Executive  Summary  in 
Section  1.  The  second  part  is  comprised  of  Sections  2-5  and  describes  the 
research  effort  in  more  detail.  We  omitted  mathematical  details  in  the  second 
part  to  ensure  clear  conceptual  discussions  free  from  mathematical  language. 
Mathematical  descriptions,  however,  are  necessary  for  those  interested  in  the 
details  of  all  algorithms  developed  or  used  in  this  effort.  These  mathematical 
details  are  found  in  Appendix  1-6,  which  comprise  the  third  part  of  the  report. 

In  Sections  2-5  we  emphasize  the  details  and  methods  we  investigated  that 
should  be  considered  in  an  operational  development  program  or  that 
consistently  produced  the  best  results.  We  omitted,  for  example,  results  obtained 
from  the  neural  network  classifiers.  Rome  Laboratory  personnel  de-emphasized 
the  neural  network  approaches  and  directed  us  not  to  pursue  these  approaches 
further.  This  does  not  mean,  however,  that  neural  networks  should  be  excluded 
from  further  speaker  identification  research.  For  a  detailed  description  of  all  our 
work,  refer  to  the  four  status  reports  compiled  in  Appendix  8. 

In  Section  3  we  describe  our  approach  to  the  research  in  more  detail.  We 
describe  the  speech  parameters  investigated,  the  VQ  classifier  used,  and  the 
method  for  fusing  partial  results  from  multiple  classifiers.  We  also  discuss  the 
open  set  issue  in  detail  and  describe  in  conceptual  terms  the  out-of-set  metrics 
used  in  the  effort.  We  provide  the  results  and  their  significance  in  Section  4. 
Section  5  is  the  conclusion  section  where  we  provide  the  lessons  learned,  discuss 
the  major  outstanding  issues,  and  provide  several  recommendations  for  the  next 
step.  The  references  are  foimd  in  Section  6.  Selective  result  print  puts  are  found 
in  Appendix  7.  Appendix  8  is  a  compilation  of  the  four  status  reports  delivered 
under  this  effort  and  Appendix  9  is  the  software  user's  manual. 

2.2  Purpose  of  Research 

The  purpose  of  this  research  was  to  explore  algorithms  for  operational 
speaker  identification  systems.  The  operational  commimications  environment  is 
very  different  from  the  laboratory  environments  in  which  speaker  identification 
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technology  was  developed.  The  operational  communications  environment  is 
characterized  by  short  (typically  less  than  10  seconds),  noisy  transmissions  and 
exhibit  a  great  deal  of  channel  variability.  Until  very  recently,  the  laboratory 
produced  wide  bandwidth  recordings  of  speech  with  high  signal-to-noise  ratios, 
which  were  typically  tens  of  seconds  even  minutes  long  for  each  speaker.  Mainly 
due  to  the  disparity  between  the  two  environments,  speaker  identification 
systems  have  not  performed  well  in  the  operational  environment. 

2.3  Statement  of  Work  Review 

Given  the  motivation  for  the  research,  this  section  briefly  reviews  the  key 
tasks  in  the  Statement  of  Work  (SOW).  The  first  task  was  to  evaluate  the 
auditory-like  model  of  hearing,  called  the  Perceptual  Linear  Prediction  of  speech, 
and  the  traditional  cepstrum  (and  its  variations).  The  second  task  was  to 
evaluate  the  recurrent  backpropagation  network,  at  least  one  other  neural 
network  classifier,  and  the  Gaussian  mixture  classifier.  The  performance  of  the 
speaker  identification  algorithms  developed  from  Tasks  1  and  2  were  to  be  tested 
using  the  KING  database  in  Task  3. 

As  work  proceeded,  an  operational  database  became  available,  the  Rome 
Laboratory  developed  GREENFLAG  database.  Since  the  KING  database  has 
been  widely  used  in  speaker  identification  research,  baseline  performance  tests 
using  the  KING  database  (for  apples-to-apples  comparisons)  were  still  necessary. 
However,  with  the  GREENFLAG  database,  we  had  the  opportunity  to  test  all 
algorithms  with  real  US  Air  Force  communications.  Therefore,  Rome  Laboratory 
personnel  directed  us  to  test  our  algorithms  with  the  new  database. 

In  addition  to  this  minor  change,  we  omitted  a  full  evaluation  of  the  different 
neural  network  classifiers.  Early  in  the  research  we  attained  very  high 
recognition  rates  in  KING  database  experiments  using  the  well  known  vector 
quantization  (VQ)  classifier.  The  VQ  classifier  also  produced  very  good  results 
in  GREENFLAG  database  experiments.  Initial  results  using  the  recurrent 
backpropagation  network  and  the  standard  backpropagation  network  were  not 
as  encouraging.  Because  of  these  reasons,  and  per  Rome  Laboratory's 
instructions,  a  full  evaluation  of  the  neural  networks  listed  in  Task  2  of  the  SOW 
was  not  conducted. 
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3.0  Approach 


3.1  Speaker  Recognition  Algorithm 

In  this  section  we  provide  a  high  level  description  of  the  individual  parts  of 
the  speaker  recognition  algorithm.  The  algorithm  is  comprised  of  two  main 
systems:  the  training  system  and  the  performance  system.  Each  system  uses  the 
same  signal  preprocessor  and  parameter  extraction  algorithms.  The  block 
diagram  in  Figure  1.  illustrates  the  algorithm  at  the  highest  level. 

In  the  preprocessing  stage,  the  sampled  communications  transmission  is 
segmented  with  a  window  of  a  predetermined  type  and  length.  These 
windowed  segments  are  typically  referred  to  as  frames.  The  frames  may  be 
overlapped  by  a  predetermined  amount.  In  all  our  experiments  we  used  a  256 
sample  Hamming  window  with  50%  overlap.  With  8  kHz  sampled  data,  this 
amounts  to  a  32  msec,  frame  that  advances  in  steps  of  16  msec,  over  the  sampled 
input  signal. 


Figure  1.  Block  Diagram  of  Speaker  Identification  Algorithm  in  Training  Mode 


After  the  signal  is  segmented,  the  next  step  in  the  preprocessing  stage  is  to 
determine  whether  the  frame  contains  speech  (specifically,  voiced  speech)  or 
backgroimd  noise.  If  the  speech/non-speech  detection  sub-system  detects 
speech  in  the  frame,  it  passes  the  frame  on  to  the  parameter  extraction  stage, 
otherwise  it  discards  the  frame.  Figure  2.  shows  the  block  diagram  of  the 
preprocessing  stage. 
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PREPROCESSING  STAGE 

to 

^  PARAMETER 

sampled  _ ^ 

WINDOV«NG 

SPEECH/ NON  SPEECH 

transmission 

* 

DETECTION 

^  EXTRACTION 

Stage 

Figure  2.  Block  Diagram  of  the  Preprocessing  Stage 


Each  windowed  frame  containing  speech  is  processed  in  the  Parameter 
Extraction  stage.  After  the  parameters  are  extracted  form  the  speech  frames,  they 
are  either  passed  to  the  training  sub-system  or  the  testing  sub-system  depending 
on  the  mode  of  operation. 

3.2  Speech  Parameters 

We  describe  all  parameters  produced  by  the  Parameter  Extraction  stage  in 
this  section.  We  investigated  two  main  groups  of  parameters.  The  first  group  is 
based  on  the  Linear  Predictive  Coefficient  (LPC)  cepstrum.  The  LPC  cepstrum  is 
widely  used  in  speech  processing  because  it  is  easy  to  compute  and  because  it 
performs  reliably  in  speech  processing  applications  [3].  The  second  group  of 
parameters  is  based  on  the  Perceptual  Linear  Prediction  (PLP)  cepstrum  [1].  By 
themselves,  the  LPC  and  PLP  cepstra  are  not  robust  in  varying  channel 
conditions.  Therefore  we  chose  two  parameter  normalization  procedures  and 
two  cepstral  extensions  that  have  improved  both  speaker  identification  and 
speech  recognition  performance.  These  include  RASTA  filtering,  littering,  the 
delta  cepstrum,  and  the  acceleration  cepstrum  [2]  [4]. 

The  block  diagram  in  Figure  3  illustrates  the  operations  of  the  Parameter 
Extraction  stage.  We  describe  each  of  the  parameters  and  the  normalization 
techniques  in  more  detail  in  the  following  paragraphs. 
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PARAMETER  EXTRACTION  STAGE 

speech 

LPC 

PARAMETERS 

PARAMETER 

to 

^  TRAINING  or 

frames 

PLP 

NORMA  UZATION 

^  TESTING 

PARAMETERS 

SYSTEM 

Figure  3.  Block  diagram  of  the  Parameter  Extraction  Stage 


3.2.1  Linear  Prediction  Coefficients.  The  current  sample  of  a  discrete  time  signal 
can  be  estimated  as  the  weighted  sum  of  a  number  of  previous  samples  of  the 
signal.  If  one  could  find  the  values  of  the  weights  in  that  sum  that  minimize  the 
error  between  the  value  of  the  current  sample  and  the  estimate,  then  we  have 
coefficients  that  can  predict  the  current  sample.  Thus  the  name  Linear  Predictive 
Coefficients.  There  are  two  main  reasons  why  the  LPC  characterization  of  speech 
has  been  ubiquitous  in  speech  processing  in  over  20  years.  The  first  reason  is 
that  the  LPCs  compactly  describe  the  spectrum  of  the  signal.  The  second,  and 
equally  important,  reason  is  that  the  LPCs  are  computed  quite  easily.  The  LPCs 
are  computed  by  solving  a  system  of  linear  equations  using  the  lagged 
autocorrelation  values  of  the  signal.  The  LPC  algorithm  and  its  proof  is  found  in 
Appendix  1. 

Notwithstanding  the  LPC's  popularity,  the  cepstrum  has  become  the  speech 
representation  of  choice  in  both  speaker  and  speech  recognition  applications  (see 
May  24,  1993  Status  Report,  pg.  10).  Even  so,  the  LPC  continues  to  play  a 
dominant  role  as  explained  in  the  next  section. 

3.2.2  LPC  Cepstrum.  In  this  section,  we  summarize  some  important  historical 
aspects  of  the  cepstrum  and  how  it  is  computed  from  the  LPCs.  For  a  more 
detailed  treatment  of  the  cepstrum  see  [5]  and  [6].  The  cepstrum  was  developed 
in  the  early  60s  as  a  tool  for  separating  or  analyzing  convolved  signals  with 
different  and  unknown  harmonic  contents.  In  the  case  of  speech,  the  cepstrum 
provided  a  method  for  separating  the  rapidly  varying  glottal  information  from 
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vocal  tract  filter  or  vice  versa.  This  is  accomplished  by  taking  the  Fourier 
transform  of  the  log  of  the  Fourier  transform  of  the  signal.  In  the  resulting 
cepstrum,  high  frequency  harmonics  are  visible  at  the  low  end  of  the  abscissa,  or 
quefrency  axis.  On  the  other  hand,  low  frequency  harmonics  are  seen  as  high 
quefrency  spikes. 

The  cepstrum  provided  a  new  tool  for  applications  such  as  pitch  detection, 
glottal  pulse  analysis,  and  formant  analysis.  Pitch  analysis  is  straightforward  in 
the  cepstrum.  A  dominant  spike  is  observed  in  the  quefrency  region  associated 
with  the  human's  pitch  range.  Since  the  quefrency  is  in  units  of  time,  the  inverse 
of  quefrency  describes  the  hmdamental  frequency  of  the  harmonic.  For  glottal 
pulse  and  formant  analysis,  simple  windowing  techniques  (or  liftering)  and  the 
inverse  cepstrum  is  all  that  is  required. 

In  the  early  70s  the  cepstrum  was  derived  from  the  LPCs  [7].  Because  this 
method  of  computing  the  cepstrum  is  more  efficient  than  the  Fourier  Transform 
approach,  the  LPC  derived  cepstrum  became  more  popular.  Because  of  its 
importance,  we  include  the  derivation  of  the  cepstrum  from  the  LPCs  in 
Appendix  1.  The  LPC  cepstrum's  popularity  is  still  strong  today  despite  the  fact 
that  Davis  and  Mermelstein  showed  [3]  that  the  Fourier  cepstrum  (especially  the 
mel  cepstrum)  outperformed  the  LPC  cepstrum  in  continuous  speech 
recognition.  Davis  and  Mermelstein  found  that  the  LPC  cepstrum  produced 
more  errors  than  the  Fourier  cepstrum  in  the  imvoiced  areas  of  speech.  This  is 
not  a  surprising  result  since  it  is  well  known  that  the  LPCs  have  difficulty 
modeling  unvoiced  speech.  Since  we  used  a  voiced /imvoiced  detector  in  most 
of  our  experiments,  we  assumed  that  the  LPC  cepstrum  would  perform  as  well 
as  the  Fourier  cepstrum.  The  Davis  and  Mermelstein  article  supports  our 
assumption. 

3.2.3  Perceptual  Linear  Prediction.  The  LPC  cepstrum  is  the  speech 
representation  of  choice  in  contemporary  speech  processing  applications. 
Recently,  however,  the  PLP  cepstrum  has  gained  some  popularity.  In  the  search 
for  more  robust  representations  of  speech,  some  in  the  community  are  studying 
auditory  models  of  speech.  Some  of  these  auditory  models  are  very  complicated 
and  require  significant  computational  resources.  For  example,  some  of  these 
models  include  fluid  and  basilar  membrane  dynamics  as  well  as  hair  cell 
activation  described  as  nonlinear  differential  equations.  Hermansky  [1]  took  a 
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different  approach  and  developed  a  modeling  technique  guided  by  some 
important  but  loosely  followed  psychophysiological  constraints  together  with 
well  understood  and  efficient  signal  processing  techniques. 

The  PLP  modeling  process  begins  with  a  straightforward  Fourier  filter  bank 
representation  of  speech.  The  filter  banks,  however,  are  carefully  designed  to 
meet  certain  auditory  psychophysiological  criteria.  From  the  filter  bank 
representation,  autocorrelation-like  coefficients  are  computed  using  the  inverse 
Cosine  transform.  These  autocorrelation-like  coefficients  are  then  used  in  the 
LPC  routine  for  computing  Perceptual  Linear  Predictive  coefficients.  With  these 
PLP  coefficients,  the  PLP  cepstrum  is  then  computed  as  described  in  Section 
3.2.2,  and  in  greater  detail  in  Appendix  1.  We  describe  the  PLP  process  in  detail 
in  Appendix  2. 

In  the  next  section,  we  describe  how  to  normalize  and  extend  the  LPC  and 
PLP  cepstra  in  order  to  make  the  speech  parameters  more  robust  in  changing 
channel  conditions. 

3.3  Speech  Parameter  Normalization 

We  used  two  parameter  normalization  techniques  and  two  cepstral 
extensions  that  are  commonly  used  in  contemporary  speech  processing  research. 
The  two  normalization  techniques  are  cepstrum  littering  and  RelAtive  SpecTrAl 
(RASTA)  filtering  [2].  The  two  cepstral  extensions  are  the  delta  cepstrum  (the 
first  derivative  of  the  cepstrum)  and  the  acceleration  cepstrum  (the  second 
derivative  of  the  cepstrum).  Figure  4  illustrates  the  Parameter  Normalization 
stage  of  our  speaker  identification  algorithm.  The  numbered  arrows  at  the 
bottom  of  Figure  4  correspond  to  the  combination  of  normalization  techniques 
performed  on  the  LPC  or  PLP  cepstra  in  the  Parameter  Normalization  Stage. 
Table  1  lists  these  combinations.  Note  that  the  DERIVATIVE  block  in  Figure  4 
has  two  feedback  lines  from  its  output  with  a  number  one  within  the  feedback 
loop.  These  feedback  loops  signify  that  the  delta  cepstrum  is  fed  back  to  the 
derivative  process  one  time  to  produce  the  acceleration  cepstrum.  In  the 
following  subsections  we  describe  the  RASTA,  littering,  and  derivative 
processes. 

3.3.1  Liftering.  It  is  clear  that  the  names  cepstrum,  quefrency,  and  liftering  are 
clever  word  plays  on  their  Fourier  analogs  spectrum,  frequency,  and  filtering. 
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Functionally,  the  two  groups  are  indeed  analogs  of  one  another.  To  filter  a  signal 
one  multiplies  the  signal  and  filter  together  in  the  spectral  domain.  Similarly,  to 
lifter  a  signal  one  multiplies  the  signal  and  lifter  in  the  cepstral  domain.  Littering 
therefore  is  a  particular  weighing  of  the  cepstrum.  Littering  can  be  used  for  two 
different  purposes.  In  our  introduction  of  the  cepstrum,  we  suggested  that  we 
could  use  littering  for  separating  different  parts  of  the  speech  waveform;  for 
example,  the  glottal  information  from  vocal  tract  filter.  This  is  the  classic  use  of 
littering.  Littering  is  also  used  as  a  meairs  of  computing  a  weighted  distance.  In 
this  study  we  use  the  Euclidean  metric;  therefore,  littering  is  used  to  compute  the 
weighted  Euclidean  distance  between  the  target  and  test  vectors.  We 
investigated  several  litters  and  found  that  the  Hamming  window  performed  the 


Figure  4.  Block  diagram  of  the  Parameter  Normalization  Stage 
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Parameter  Combinations 


Number 

Parameter 

1 

cepstrum 

2 

liftered  cepstrum 

3 

liftered  cepstrum  with  RASTA  filtering 

4 

cepstrum  with  RASTA  filtering 

5 

delta  cepstrum  with  RASTA  filtering 

6 

acceleration  cepstrum  with  RASTA  filtering 

7 

acceleration  cepstrum 

8 

delta  cepstrum 

Table  1 


best.  For  a  more  in-depth  discussion  on  the  weighted  Euclidean  metric  for  the 
cepstrum  see  [6]  pp.  377-379. 

3.3.2  RASTA  Filtering.  Simply  put,  the  RASTA  operation  is  the  bandpass 
filtering  of  each  cepstral  channel  (or  coefficient)  over  time.  Hermansky 
introduced  RASTA  to  remove  the  slow  varying  processes  induced  by  the 
channel.  We  included  the  filter  details  in  Appendix  3. 

3.3.3  Delta  and  Acceleration  Cepstrum.  The  common  method  of  finding  the 
derivative  of  a  sequence  at  a  point  is  through  an  N  point  linear  regression. 
Therefore,  to  find  the  delta  cepstrum,  we  used  an  N  point  linear  regression  over 
each  cepstral  channel.  For  the  acceleration  cepstrum,  we  repeated  the  process  on 
the  delta  cepstrum.  We  experimented  with  N  =  3  though  N  =  7.  We  included 
the  linear  regression  equation  in  Appendix  1. 

After  the  Parameter  Normalization  Stage  produces  all  parameters,  they  are 
sent  to  either  the  Training  System  or  Testing  System  depending  on  the  speaker 
identification  algorithm's  mode  of  operation.  The  training  and  testing  sub¬ 
systems  are  described  in  terms  of  the  Linde-Buzo-Gray  Algorithm  in  Section  3.4. 

3.4  Classifier 

The  classifier's  training  procedure  defines  the  training  system.  Similarly,  the 
testing  system  is  defined  by  how  the  classifier  is  used  during  performance  and 
how  the  output  scores  the  classifier  produces  are  accumulated  to  provide  an 
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answer.  In  this  section  we  explain  how  we  used  a  vector  quantization  clustering 
algorithm  for  our  training  and  testing  systems. 

We  used  a  vector  quantization  clustering  algorithm  called  the  Linde-Buzo- 
Gray  (LBG),  or  K-means  algorithm,  for  training  the  VQ  classifier.  This  algorithm 
is  also  known  as  the  Generalized  Lloyd's  algorithm  in  the  data  compression 
literature.  In  short,  the  LBG  algorithm  groups  the  training  vectors  into  K  distinct 
vectors  or  codewords  in  such  a  way  as  to  minimize  the  total  average  error  between 
the  training  data  and  the  codewords.  We  used  the  mean  squared  error  criteria, 
which  is  same  as  the  normalized,  squared  Euclidean  distance.  We  describe  the 
LBG  algorithm  in  more  detail  in  Appendix  4. 

The  LBG  algorithm  comprises  our  Training  System.  The  training  vectors 
correspond  to  all  the  cepstral  parameters  (of  a  single  type)  obtained  from  the 
training  signal's  speech  frames.  Thus,  each  speaker's  training  signal  is  passed 
through  the  Preprocessing  Stage,  through  the  Parameter  Extraction  Stage,  and 
finally  through  the  Training  System.  The  Training  System  produces  a  codebook 
(a  collection)  of  K  codewords  for  each  parameter  class,  for  each  speaker.  For 
example,  for  a  three  parameter  speaker  identification  system  for  20  speakers,  a 
total  of  60  codebooks  are  generated  during  training.  We  found  that  40  codeword 
codebooks  produced  the  best  results  in  both  GREENFLAG  and  KING  database 
experiments. 

To  simplify  our  description  of  the  testing  procedure,  let  us  concentrate  our 
discussion  on  a  single  codebook  -  for  some  parameter  type  -  from  some  speaker. 
Like  the  training  signal,  the  test  signal  is  decomposed  into  a  sequence  of  vectors 
after  it  passes  through  the  Preprocessing  and  Parameter  Extraction  stages.  This 
entire  sequence  of  vectors  is  passed  through  the  codebook  and  the  average  of  all 
the  lowest  mean  squared  error  (MSE)  scores  is  computed.  This  procedure  is 
repeated  for  all  parameter  types  and  for  all  speakers.  In  the  next  section  we 
describe  how  to  combine  the  average  lowest  MSE  scores  produced  by  each 
individual  parameter  codebook,  from  each  speaker,  to  arrive  at  the  final  decision. 

3.5  Parameter  Fusion 

In  Section  3.3  we  described  eight  individual  parameters  we  compute  for  each 
frame  of  speech.  Therefore,  as  per  Section  3.4,  there  are  eight  possible  codebooks 
trained  for  each  speaker.  Similarly  eight  different  codebook  scores  are  generated 
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during  a  test.  The  question  is  how  to  combine  these  scores  meaningfully  to 
acquire  better  results  than  can  be  attained  by  any  individual  parameter.  We  use 
a  very  simple  technique.  For  each  parameter  class,  we  find  the  ratio  between 
each  speaker's  score  and  the  winning  speaker's  score.  After  all  ratios  are 
computed  for  each  parameter  class,  we  simply  add  each  speaker's  ratios.  The 
speaker  with  the  lowest  ratio  sum  is  chosen  as  the  target  speaker. 

A  simple  example  will  illustrate  the  parameter  fusion  or  adjudication  process. 
Suppose  there  are  three  speakers  (si,  s2,  and  s3)  in  the  set.  A  codebook  based  on 
the  cepstrum  and  a  codebook  based  the  delta  cepstrum  is  trained  for  each 
speaker.  A  test  signal  for  s2  is  decoded  by  the  two  codebooks  and  produces  the 
results  shown  in  Table  2.  These  are  typical  scores  encountered  in  real  tests. 


Codebook  Adjudication  Example 


Cepstrum _ Delta  Cepstrum 


^1^11^91111^1 

MSE 

ratio 

MSE 

ratio 

si 

.0504 

1.00 

.00053 

1.23 

s2 

.0509 

1.01 

.00043 

1.00 

s3 

.0723 

1.43 

.00081 

1.88 

Table  2 


The  ratio  sums  for  si  through  s3  are  2.23,  2.01  and  3.31  respectively.  In  this 
example,  s2  is  chosen  as  the  target  speaker,  which  is  the  correct  speaker.  Simply 
adding  the  MSE  scores  directly  would  not  have  produced  the  correct  result.  The 
MSE  ratio  gives  a  relative  measure  of  closeness  within  a  given  parameter  space 
and  thus  provides  an  intuitive  vehicle  for  combining  disparate  parameter 
codebook  scores. 

In  our  tests  we  foimd  that,  in  most  cases,  multi-parameter  fusion  produced  better 
results  than  could  be  attained  through  any  single  parameter.  The  ratio  method  is 
not  the  only  method  of  combining  individual  classifier  results. 
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3.6  Significance  of  Open  Set  Classification  in  Operational  Speaker  ID 

The  speaker  identification  problem  is  a  classification  problem  where  the 
classes  are  defined  as  the  individual  speakers  of  some  predefined  group  or  set. 
Closed  set  speaker  identification  involves  only  speakers  from  the  given  set  of 
previously  trained  speakers.  In  other  words,  the  test  speaker  is  known  to  be  one 
from  the  set.  In  this  case  the  classifier  has  only  to  find  the  speaker  model  that 
best  matches  the  test  speaker.  The  classifier  performance  is  based  on  the 
percentage  of  correct  identifications. 

In  operational  speaker  identification,  however,  one  cannot  guarantee  that  the 
test  speaker  is  in  the  set  of  trained  speakers.  This  is  an  open  set  classification 
problem.  The  classifier  must  first  determine  whether  the  test  speaker  is  one  in 
the  desired  set  before  it  tries  to  match  speakers.  Equivalently,  the  classifier  could 
first  match  the  test  speaker  to  a  potential  candidate  in  the  trained  set,  and  then 
determine  whether  it  really  is  the  target  speaker.  We  used  the  latter  approach  in 
our  investigations.  Classifier  performance  in  this  case  is  based  on  three 
parameters:  percent  detection  (how  many  test  speakers  from  the  desired  set  were 
classified  as  in-set  speakers),  percent  correct  given  detection  (how  many  of  the 
test  speakers  correctly  classified  as  in-set  speakers  were  correctly  identified),  and 
percent  false  alarms  (how  many  test  speakers  not  in  the  desired  set  were 
incorrectly  identified  as  in-set  speakers). 

Open  set  classification  is  more  difficult  because  boundaries  (or  score 
thresholds)  must  be  determined  for  each  speaker  in  the  trained  set.  In  the  next 
section  we  describe  the  metrics  we  used  to  determine  those  boundaries. 

3.7  Out-of-Set  Metrics 

We  investigated  four  out-of-set  metrics  in  this  effort.  These  include  global 
MSE  score  thresholds,  individual  thresholds  based  on  the  average  of  the  N 
closest  matching  speakers'  scores,  cohort  normalized  thresholds,  and  thresholds 
based  on  the  probability  of  observing  a  candidate  speaker's  score.  We  describe 
each  of  these  in  the  following  subsections.  The  mathematical  details  are  found  in 
Appendix  6. 

3.7.1  Global  MSE  Score  Thresholds.  Just  as  the  name  implies,  with  this  method, 
a  global  MSE  score  threshold  T  is  determined  for  all  speakers.  If  the  candidate 
speaker's  MSE  score  is  greater  than  T  the  candidate  is  rejected;  otherwise,  the 
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candidate  is  accepted  as  the  identified  speaker.  This  is  the  simplest  method 
because  it  requires  no  additional  computational  steps  to  determine  the  threshold. 
However,  one  would  not  expect  that  all  speakers  have  the  same  optimal 
threshold.  If  this  is  true  we  need  to  find  a  method  of  getting  the  best  threshold 
for  each  individual.  The  following  metrics  attempt  to  accomplish  this. 

3.7.2  N  Closest  Speakers’  Scores  Thresholds.  With  this  metric  each  speaker's 
threshold  is  derived  from  scores  produced  by  testing  all  other  speakers'  training 
signals  against  the  given  speaker's  codebook.  The  threshold  is  set  to  the  average 
N  smallest  MSE  scores  produced.  The  motivation  here  is  to  find  the  speakers 
with  the  acoustic  characteristics  that  best  match  the  speaker  in  question,  and  to 
set  a  threshold  based  on  the  scores  produced  by  their  speech.  The  following 
method  uses  a  variation  on  this  theme. 

3.7.3  Cohort  Normalized  Thresholds.  There  is  another  method  of  finding  the 
speakers  that  are  acoustically  closest  to  a  target  speaker.  Take  a  given  speaker's 
training  signal  and  test  it  against  all  other  speaker's  models  (or  codebooks).  The 
speaker  models  that  produce  the  smallest  scores  are  the  ones  acoustically  closest 
to  the  speaker  in  question.  The  closest  speakers  in  this  case  are  known  as  cohorts 
[8].  There  is  a  method  of  normalizing  a  candidate  speaker's  score  using  his 
cohort's  scores.  It  is  cumbersome  to  describe  this  method  without  mathematics; 
therefore,  we  describe  the  details  in  Appendix  6.  Suffice  it  to  say  that  if  this 
normalized  score  is  less  than  some  threshold  C,  then  the  candidate  is  rejected; 
otherwise  the  candidate  is  accepted  as  the  speaker. 

3.7.4  Probability  Thresholds.  With  this  method,  we  want  to  find  the  probability 
that  the  scores  previously  produced  by  the  candidate  are  greater  than  the  current 
candidate's  score.  If  that  probability  is  less  than  some  threshold  P,  then  the 
candidate  is  rejected,  otherwise  the  candidate  is  accepted  as  the  identified 
speaker.  To  find  the  probability  in  question,  one  produces  a  cumulative 
distribution  of  scores  obtained  by  testing  the  speaker's  training  signal  against  the 
speaker's  codebook.  The  next  step  is  to  subtract  that  distribution  from  one. 
Upon  testing,  the  value  of  the  resulting  cumulative  distribution  at  the  candidate's 
score  is  the  probability  we  are  looking  for.  The  training  signal  must  be  broken  up 
into  smaller  segments  in  order  to  acquire  enough  scores  to  make  a  probability 
density.  Subsequent  testing  must  be  done  at  the  same  temporal  scale  as  was 
used  to  determine  the  probability  densities. 
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3.8  Summary  of  Approach 

In  Section  3  we  provided  an  architectural  and  functional  description  of  the 
main  speaker  identification  system  we  investigated  in  this  effort.  We 
investigated  several  other  parameters,  classifiers,  and  scoring  approaches  that  we 
did  not  include  in  this  final  report  because  we  felt  it  distracted  from  the 
discussion  of  the  best  approaches  for  developing  an  operational  speaker 
identification  system.  All  our  work  is  fully  documented  in  the  four  status  reports 
we  delivered  to  Rome  Laboratory. 

4.0  Results 


4.1  Introduction 

As  we  previously  stated,  we  used  two  databases  to  test  the  performance  of 
the  speaker  identification  system.  Since  the  results  of  several  speaker 
identification  studies  are  available  that  used  the  KING  database,  we  first 
performed  KING  database  tests  to  make  meaningful  comparisons  and  to  baseline 
our  system.  Although  the  KING  database  is  not  comprised  of  operational 
communications,  it  is  made  of  long  distance  telephone  communications  over  a 
variety  of  channel  conditions.  This  makes  the  database  very  useful  in  speaker 
identification  research  for  it  provides  an  environment  rich  in  channel  diversity. 
Once  we  determined  the  baseline  performance  of  our  system,  we  tested  the 
algorithms  with  Rome  Laboratory's  GREENFLAG  database.  The  GREENFLAG 
database  is  comprised  of  tactical  USAF  commimications  during  Green  Flag 
exercises  at  Nellis  AFB.  The  GREENFLAG  database,  therefore,  is  the  database  of 
choice  for  USAF  operational  speaker  identification  R&D  efforts.  In  sections  4.2 
and  4.3  we  summarize  the  results  from  our  KING  and  GREENFLAG  database 
experiments. 

4.2  King  Database  Results 

4.2.1  KING  Database  Description.  The  KING  database  is  comprised  of  long 
distance  telephone  recordings  of  50  speakers.  There  are  a  total  of  10  sessions  in 
which  all  50  speakers  are  represented.  The  sessions  average  approximately  40 
seconds  in  length.  The  database  is  rich  in  channel  variability.  Sessions  1-5  and 
sessions  6-10  were  recorded  using  two  different  recording  setups  respectively. 
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Furthermore,  26  speakers  were  recorded  at  a  site  in  San  Diego,  California  and  the 
other  24  speakers  were  recorded  at  a  site  in  Nutley,  New  Jersey.  The  Nutley 
recordings  are  much  noisier  than  the  San  Diego  recordings.  Also,  to  capture 
variability  in  each  speaker's  voice,  the  different  sessions  were  recorded  over 
several  months. 

42.2  How  KING  Database  is  Used.  The  KING  database  is  typically  used  in 
strategic  speaker  identification  research.  In  strategic  speaker  identification,  target 
speakers  are  modeled  for  long  term  identification  purposes  regardless  of  the 
commimications  medium.  The  usual  approach  in  this  type  of  speaker  id  is  to 
model  the  speaker  in  as  many  different  channel  conditions  as  possible.  In  order 
to  produce  these  general  speaker  models,  many  seconds  or  even  minutes  of 
speech  data  are  required.  Following  this  approach,  most  investigators  that  use 
the  KING  database  usually  use  three  sessions  for  training  (or  nearly  two  minutes 
of  data),  and  use  two  other  sessions  for  testing.  We  followed  the  same  CONcept 
of  operations  (CONOPS)  or  procedure  in  our  KING  experiments.  Specifically 
we  used  the  26  speaker  San  Diego  group  for  most  of  our  tests.  We  used  sessions 
1-3  for  training  and  sessions  4  and  5  for  testing.  We  conducted  only  closed  set 
experiments  with  the  KING  database. 

4.2.3  Results  with  KING  Database;.  We  found  that  the  LPC  based  cepstral 
parameters  significantly  outperformed  the  PLP  based  cepstral  parameters.  The 
best  combination  of  PLP  parameters  produced  88.5%  recognition  and  the  best 
combination  of  LPC  parameters  produced  100%  recognition  (see  May,  July,  and 
November  Status  Reports  for  details).  In  Table  3,  we  summarize  the  results  we 
obtained  using  several  combinations  of  LPC  cepstral  parameters  and  the 
classifier  adjudication  process  we  described  in  Section  3.  We  show  the  results  of 
only  a  subset  of  every  possible  combination  of  parameters  shown  in  Table  1. 

Table  3  is  representative  of  the  results  we  obtained  in  most  KING 
experiments.  We  obtained  the  best  results  by  using  both  parameter 
normalization  procedures,  littering  and  RASTA  filtering,  together  with  the  delta 
cepstrum  extension.  We  also  found  that  the  acceleration  cepstrum  did  not 
provide  as  much  improvement  in  performance  as  the  delta  cepstrum.  In 
Appendix  7  we  include  the  confusion  matrix  outputs  of  selective  KING  database 
performance  runs. 
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Summary  of  KING  Results 


Parameter  Combination 

Percent  Correct 

liftered  cepstrum 

76.9  % 

RASTA-liftered  cepstrum 

88.5  % 

delta  cepstrum 

92.3  % 

acceleration  cepstrum 

76.9  % 

liftered  cepstrum  &  delta  cepstrum 

84.6  % 

liftered  cepstrum  &  acceleration  cepstrum 

84.6  % 

RASTA-liftered  cepstrum  &  delta 

100.0  % 

RASTA-liftered  cepstrum  &  acceleration 

96.2  % 

liftered  cepstrum,  delta,  &  acceleration 

84.6  % 

Table  3 


4.3  GREENFLAG  Database  Results 

4.3.1  GREENFLAG  Database  Description.  The  Rome  Laboratory  GREENFLAG 
database  is  a  collection  of  100  hours  of  off-the-air  tactical  communications  related 
to  takeoffs  and  landings.  For  the  experiments  described  here,  transmissions  from 
41  speakers  and  8  aircraft  were  used.  The  following  table  identifies  the  number 
of  speakers  associated  with  each  aircraft  type. 


Number  of  Speakers  in  Aircraft  Type 


■ 

Number  of  Speakers 

Platform  Type 

4 

AlO 

1 

B52 

1 

C130 

1 

EA6 

4 

F4G 

9 

F15 

11 

F16 

3 

Fill 
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1 

F117 

2 

RF4C 

4 

Towers 

Table  4 


The  number  of  transmissions  per  speaker  range  from  3  to  12.  The 
transmissions  are  also  very  short,  ranging  from  less  than  a  second  to  eight 
seconds,  and  have  a  wide  range  and  level  of  background  aircraft  noise.  We 
arbitrarily  used  the  first  two  transmissions  for  training  and  all  subsequent 
transmissions  for  testing.  We  regard  the  two  concatenated  training  transmissions 
as  a  single  transmission.  In  all  we  used  41  transmissions  for  training  and  159 
transmissions  for  testing.  The  following  tables  provide  training  and  testing 
transmission  lengths  and  statistics. 


Training  Transmission  Length  Statistics  For  41  Transmissions 


stat 

seconds 

max 

8.69 

min 

2.10 

mean 

4.24 

Table  5 


Training  Transmission  Lengths 


Transmission  Length  (tZ) 

Number  of  Transmissions 

2.0  <  tl  <3.0 

6 

3.0  <  tl  <4.0 

15 

4.0  <  tl  <5.0 

12 

5.0  <  tl  <6.0 

3 

6.0  <  tl  <7.0 

3 

7.0  <  tl  <8.0 

0 

8.0  <  tl  <9.0 

2 

Total 

41 

Table  6 
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Test  Transmission  Length  Statistics  For  159  Transmissions 


stat 

seconds 

max 

6.75 

min 

0.36 

mean 

2.35 

Table  7 


Test  Transmission  Lengths 


Transmission  Length  (tl) 

Number  of  Transmissions 

0.0  <  tl  <1.0 

20 

1.0  <  tl  <2.0 

77 

2.0  <  tl  <3.0 

39 

3.0  <  tl  <4.0 

13 

4.0  <  tl  <5.0 

7 

5.0  <  tl  <6.0 

2 

6.0  <  tl  <7.0 

1 

Total 

159 

Table  8 


4.3.2  GREENFLAG  Database  and  Tactical  CONOPS.  In  our  discussion  of  the 
KING  database  and  its  use,  we  introduced  the  term  CONOPS  to  define  how 
speaker  identification  is  conducted  for  strategic  purposes.  The  CONOPS  for 
tactical  speaker  id  is  very  different.  In  a  tactical  situation,  on-line  speaker 
training  may  be  required  with  very  limited  data  and  subsequent  identification 
may  only  be  necessary  during  a  particular  mission.  For  example,  an  operator 
may  want  to  track  pilots  on  a  mission-by-mission  basis.  With  this  CONOPS  the 
backgroimd  noise  and  chaimel  may  actually  help  distinguish  the  different  pilots, 
especially  if  they  fly  different  aircraft.  In  this  case,  parameter  normalization 
techniques  for  reducing  chaimel  effects  may  reduce  identification  performance, 
particularly  when  very  little  data  is  used  for  training.  The  results  we  show  in  the 
next  section  supports  our  conjecture. 
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4.3.3  Closed-Set  GREENFLAG  Results.  In  Table  9  we  show  the  closed-set 
results  obtained  from  the  159  test  transmissions.  Note  that  RASTA  filtering 
degraded  overall  performance.  This  result  is  consistent  with  the  putative 
channel  normalizing  effects  of  RASTA  filtering.  Since  we  assume  the  acoustic 
background  in  the  transmissions  are  helping  to  separate  the  different  speakers, 
one  would  expect  that  channel  normalization  would  degrade  performance. 
Again  we  find  that  acceleration  cepstrum  either  reduced  or  provided  no 
significant  performance  gains. 

The  effects  of  liftering  are  very  interesting.  We  found  in  our  GREENFLAG 
tests,  that  by  themselves,  the  cepstrum  and  the  liftered  cepstrum  perform  about 
the  same.  As  shown  in  Table  9,  the  91.8%  recognition  (for  the  cepstrum)  versus 
the  91.2  %  recognition  (for  the  liftered  cepstrum)  represents  a  difference  of  only 
one  misclassified  transmission.  However,  in  combination  with  the  delta 
cepstrum,  their  effects  on  performance  are  significant.  With  the  basic  cepstrum 
combination  only  three  more  transmission  were  correctly  classified.  With  the 
liftered  cepstrum  combination  seven  more  transmissions  were  correctly 
classified.  We  include  the  transmission-by-transmission  results  in  Appendix  7. 


Closed-Set  Identification  Results 


Parameter  Combination 

Percent  Correct 

cepstrum 

91.8  % 

liftered  cepstrum 

91.2  % 

delta  cepstrum 

66.7  % 

acceleration  cepstrum 

32.1  % 

RASTA-liftered  cepstrum 

63.5  % 

cepstrum  &  delta  cepstrum 

93.7  % 

cepstrum  &  acceleration  cepstrum 

89.9% 

liftered  cepstrum  &  delta  cepstrum 

95.6  % 

liftered  cepstrum  &  acceleration  cepstrum 

91.8  % 

RASTA-liftered  cepstrum  &  delta 

74.2  % 

RASTA-liftered  cepstrum  &  acceleration 

61.0  % 

Table  9 
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4.3.4  Open-Set  GREENFLAG  Results.  In  this  section  we  present  our  open-set 
results.  We  did  not  conduct  any  open-set  experiments  with  the  probability 
method.  We  decided  to  test  a  different  classification  scheme  that  uses  the 
statistics  of  the  output  scores  (see  Appendix  5)  rather  than  the  scores  themselves. 
The  best  results  we  achieved  using  this  method  was  88.5%.  We  hypothesize  that 
this  lower  performance  resulted  from  non-representative  cumulative 
distributions  due  to  insufficient  data.  Because  of  this  reason  we  abandoned  the 
probability  metric  for  open-set  detection  in  GREENFLAG  database  tests. 

With  the  open-set  problem,  it  was  not  clear  how  to  best  combine  the  results 
obtained  from  the  different  parameters.  Our  first  approach  was  to  reject  the 
candidate  speaker  only  when  both  cepstral  and  delta  cepstral  scores  failed  the 
threshold  criteria.  We  found  that  this  approach,  the  delta  cepstrum  did  not 
provide  any  additional  performance  improvements.  This  is  evident  in  operating 
curves  shown  in  Figure  5  and  6.  These  operating  curves  were  obtained  using  the 
N  Closest  Speakers'  Scores  method  described  in  Section  3.7.2.  For  the  Global 
Threshold  and  Cohort  Threshold  methods,  we  show  the  operating  curves 
obtained  from  the  cepstrum  alone  in  Figures  7-10. 

The  operating  curves  in  all  cases  were  obtained  by  alternating  each  speaker  as 
an  out  of  set  speaker.  All  test  transmissions  were  then  processed  through  the 
identification  system  in  each  iteration.  Therefore,  for  each  data  point,  6519  (41 
times  159)  transmissions  were  processed. 

It  is  clear  from  Figures  5-10  that  the  three  out-of-set  metrics  produced  similar 
results.  We  suggest  two  methods  to  improve  performance.  The  first  method  is 
to  find  a  candidate  speaker  by  using  the  parameter  fusion  process  we  described 
in  Section  3.5.  Afterwards,  use  only  the  cepstral  parameter  for  the  out-of-set  test. 
Although  the  false  alarm  rate  will  not  improve,  the  detection  rate  should  increase 
several  percentage  points  (see  Section  4.3.3).  The  second  suggestion  is  to  use  a 
separate  set  of  training  data  to  find  individual  MSE  thresholds  or  cohort 
normalized  score  thresholds,  lire  single  point  around  the  19%-82%  false  alarm- 
detection  rate  in  Figure  9  was  produced  by  arbitrarily  using  portions  of  each 
speaker's  test  transmissions  to  find  individual  cohort  normalized  score 
thresholds.  Subsequently,  we  processed  the  rest  of  the  test  transmissions  in  the 
usual  way.  With  this  method,  we  improved  the  false  alarm  rate  from  48%  to  19% 
at  the  82%  detection  rate. 
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RECOGNITION  RATE  GIVEN  DETECTION  (%)  DETECTION  RATE{%) 


THRESHOLDS  WITH  N  CLOSEST  SPEAKERS'  SCORES 


Figure  5.  False  Alanti  Vs.  Detection  Rate  From  N  Closest  Speakers'  Scores  Thresholds 
THRESHOLDS  WITH  N  CLOSEST  SPEAKERS'  SCORES 


Figure  6.  Recognition  Given  Detection  Rate  Form  N  Closest  Speakers'  Scores  Thresholds 
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DETECTION  RATE(%) 


4.4  Summary  of  Results 

In  summary,  we  attained  very  respectable  results  in  this  effort.  In  closed-set 
experiments  using  the  26  speaker  San  Diego  group  subset  of  the  KING  database 
we  attained  100%  recognition.  This  recognition  rate  was  obtained  by  fusing  the 
RASTA-liftered  cepstrum  with  the  delta  cepstrum  using  the  codebook 
adjudication  process  described  in  Section  3.  The  scenario  simulated  in  the  KING 
experiments  is  typical  of  a  strategic  speaker  identification  application. 

In  a  simulated  tactical  environment  using  the  GREENFLAG  database,  we 
attained  95.6%  recognition  of  41  pilots  and  air  traffic  controllers  from  159  test 
transmissions  in  closed-set  tests.  We  obtained  this  performance  rate  using  an 
average  transmission  length  of  only  4.2  seconds  for  training  and  an  average 
transmission  length  of  2.4  seconds  for  testing.  In  GREENFLAG  open-set 
experiments,  the  best  false  alarm  rate  we  obtained  was  19%  at  a  detection  rate  of 
82%.  The  recognition  given  detection  rate  was  nearly  99%  at  the  19%  false  alarm 
rate.  In  all  GREENFLAG  experiments  we  found  that  the  liftered  cepstrum  in 
conjunction  with  the  delta  cepstrum  provided  the  best  results. 

5.0  Conclusion 


5.1  Lessons  Learned 

In  this  section  we  briefly  discuss  the  most  important  lessons  we  learned 
during  the  course  of  this  effort.  We  present  these  lessons  in  the  four  categories 
highlighted  in  Sections  5.1.1-5.1.4  that  follow. 

5.1.1  Parameters,  Parameter  Normalization,  and  Channel  Effects.  The  parameter 
normalization  requirements  for  high  performance  speaker  identification  depend 
on  the  expected  communications  channel  conditions.  The  expected  channel 
conditions  are  in  turn  dependent  on  the  application's  CONORS.  With  typical 
strategic  CONORS,  where  target  speakers  are  modeled  for  long  term 
identification  in  a  variety  of  communications  chaimels,  parameter  normalization 
is  necessary  for  better  performance.  Different  communications  channels  produce 
different  distortions  in  the  speech  parameters.  Therefore,  normalization  inhibits 
these  distortions  and  makes  it  easier  for  the  classifier  to  correctly  match  acoustic 
features. 
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On  the  other  hand,  with  a  tactical  CONOPS,  in  which  target  speakers  are 
trained  on-the-fly  for  subsequent  mission  tracking,  the  channel  and  audio 
background  actually  help  the  discrimination  process.  This  occurs  because  the 
communications  channels  tends  to  be  fairly  constant  and  quite  different  from 
speaker  to  speaker.  Because  of  this,  channel  normalization  should  not  be 
performed. 

We  now  turn  our  attention  to  the  parameter  and  normalization  details 
themselves.  We  found  that  in  all  cases  the  LPC  cepstra  produced  better  results 
than  the  PLP  cepstra.  We  also  foimd  that  the  acceleration  cepstrum  actually  hurt 
performance  in  some  cases,  and  at  best  provided  little  performance 
improvements.  The  delta  cepstrum,  however,  significantly  improved 
performance  when  used  in  conjunction  with  one  of  the  cepstral  parameters. 
Although  liftering  is  sometimes  considered  a  parameter  normalization  technique, 
we  foimd  that  it  produced  better  results  in  the  tactical  GREENFLAG  database 
experiments  as  well.  We  do  not  believe  that  liftering  attenuates  channel  induced 
distortions.  Instead  it  seems  to  enhance  cepstral  features  that  are  important  for 
speaker  discrimination  (see  [6],  pp.  377-379  for  a  discussion  on  the  effects  of 
cepstral  liftering).  RASTA  filtering,  on  the  other  hand,  appears  to  decrease 
channel  effects  as  advertised. 

All  this  is  good  news,  for  we  can  reduce  our  potential  list  of  parameters  to 
three:  the  liftered  cepstrum,  the  delta  cepstrum,  and  the  RASTA-liftered 
cepstrum.  Our  parameter  choice  decisions  are  now  also  quite  simple.  In  a 
strategic  speaker  id  environment,  use  the  RASTA-liftered  cepstrum  with  the 
delta  cepstnun;  in  a  tactical  speaker  id  environment,  use  the  liftered  cepstrum 
with  the  delta  cepstrum. 

5.1.2  Classifiers;.  We  found  that  the  VQ  performance  was  as  good  or  better 
than  the  performance  of  the  two  backpropagation  networks.  This  was  one  of  the 
reasons  why  we  discounted  the  networks  early  in  the  program.  The  other  major 
and  more  important  reason  was  that  it  took  an  inordinate  amount  of  training 
time  (several  days)  for  the  networks  to  perform  well.  The  VQ  classifier,  on  the 
other  hand,  trains  in  minutes,  not  hours  or  days.  However,  this  does  not  mean 
that  backpropagation  networks  should  not  be  considered  for  operational  speaker 
identification  systems.  Backpropagation  networks  may  still  play  an  important 
role  in  strategic  systems  where  long  training  times  could  be  tolerated. 


27 


5.1.3  Open  Set  Issues',.  The  main  lesson  we  learned  during  our  open  set 
experiments  was  that  the  best  false  alarm-detection  rate  performance  can  be 
achieved  by  having  a  second  set  of  training  data  available  for  setting  individual 
speaker  score  thresholds.  Without  the  second  set  of  training  data,  we  can  expect 
at  best  a  27  ^  false  alarm  rate  at  80%  detection.  With  a  second  training  set  we  can 
expect  at  least  a  10%  reduction  in  the  false  alarm  rate.  In  all  cases  the  recognition 
rate  given  detection  is  greater  than  or  equal  to  the  closed  set  recognition  rate.  We 
claim  these  results  constitute  a  worst  case  scenario. 

We  also  found  that  we  can  increase  the  detection  rate  by  sticking  to  the 
original  speaker  identification  procedure.  In  other  words,  first  follow  the 
parameter  fusion  process  as  in  the  closed  set  procedure,  then  test  for  out-of-set 
with  only  the  cepstral  parameter  threshold. 

5.1.4  Training  Data  Requirements.  The  amount  of  training  data  required 
depends  on  the  speaker  identification  CONOPS,  the  channel  conditions  expected, 
and  the  open  set  requirements;  all  issues  discussed  in  the  previous  subsections. 
Not  surprisingly,  we  found  that  in  all  cases  the  more  training  data  available  the 
better.  In  a  strategic  scenario,  performance  is  greatly  improved  when  more 
training  data  is  available  from  various  communications  channels.  Even  when 
parameter  normalization  is  used,  it  is  important  to  model  as  many  channel 
variations  as  possible.  In  a  tactical  environment,  at  least  two  sets  of  training  data 
is  required  for  good  open  set  identification.  Perhaps  the  most  important  lesson 
we  learned  in  this  effort  was  to  use  whatever  data  is  available.  Even  with  as  little 
as  a  few  seconds  of  training  data  in  a  tactical  scenario,  high  performance  speaker 
identification  is  possible. 

5.2  Outstanding  Issues 

As  in  any  research  effort,  new  questions  arose  and  a  few  issues  remained 
unresolved  in  this  effort.  We  present  these  issues  in  the  following  list: 

a)  We  do  not  know  how  our  system  will  perform  on  a  large  population  (over 
100)  of  speakers.  It  is  also  not  clear  whether  large  population  speaker 
identification  is  really  needed.  It  is  possible  that  any  large  population  application 
could  be  broken  down  into  categories  in  order  to  reduce  the  set  of  target 
speakers. 
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b)  We  did  not  investigate  signal  segmentation  methods  for  isolating 
speaker's  transmissions.  We  assumed  perfect  segmentation  in  our  experiments. 
One  approach  to  this  problem  is  to  make  speaker  identification  decisions  within 
a  one  to  two  second  sliding  window.  In  this  way  speaker  transitions  can  be 
detected.  We  have  seen  evidence  that  supports  this  approach.  Our  speaker 
identification  demonstration  system  reports  intermediary  individual  frame 
results  every  300  msec.  These  intermediary  results  almost  always  match  the  final 
decision  made  at  the  end  of  the  transmission.  Because  of  this  we  are  confident 
that  a  one  to  two  second  window  is  sufficient  for  high  performance 
identification.  We  can  also  perform  onset  and  offset  key-click  detection  to 
segment  entire  transmission  if  these  key  clicks  are  available  in  the  transmissions. 

c)  Notwithstanding  the  LPC  cepstrum's  success  in  this  study,  there  are 
several  other  speech  modeling  techniques  that  can  be  studied.  We  previously 
mentioned  that  the  mel-cepstrum  performed  better  than  the  LPC  cepstrum  in 
continuous  speech  recognition  experiments  [3].  Therefore,  the  mel-cepstrum  is  a 
good  candidate  for  study.  There  are  also  completely  different  approaches,  such 
as  higher-order  spectral  modeling  [9]  and  wavelet  analysis  [10].  These  methods 
may  be  better  able  to  model  and  separate  out  noise  from  the  acoustics  of  speech. 

d)  RASTA  filtering  was  the  only  true  parameter  normalization  technique  we 
studied  in  this  effort.  There  are  other  techniques  available  such  as  the  mean 
normalization  and  blind  deconvolution  that  can  also  reduce  channel  effects. 
Other  techniques  such  as  Principal  Components  Analysis  for  feature  extraction 
and  discriminant  analysis  [11]  have  the  potential  to  isolate  the  cepstral 
components  that  best  separate  the  speakers.  These  components  could  also  be  less 
susceptible  to  chaimel  noise  [4]. 

e)  Other  classifiers  need  to  be  studied  in  order  to  determine  whether 
classifier  fusion  will  improve  performance.  A  good  candidate  is  the  Ramping 
Adaptive  Vigilance  Network,  or  RAVN,  developed  by  Booz*  Allen  in  another 
USAF  program.  RAVN  has  a  built  in  out-of-set  detection  mechanism  that  could 
be  useful  for  speaker  identification.  If  different  classifiers  are  studied,  other 
fusion  techniques  have  to  be  investigated  as  well. 

f)  The  out-of-set  metrics  we  investigated  were  not  tested  in  a  strategic 
environment.  We  did  not  have  the  time  to  perform  open  set  tests  with  the  KING 
database.  However,  others  have  reported  excellent  open  set  speaker 
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identification  results  in  strategic  like  experiments  using  the  cohort  normalization 
technique  [12]. 

5.3  Recommendations 

In  our  opinion,  speaker  identification  technology  and  digital  signal 
processing  technology  are  at  the  stage  where  portable  operational  speaker 
identification  systems  can  be  developed  and  fielded.  We  believe  our  findings  in 
this  effort  support  this  claim.  In  this  final  section  of  our  report  we  provide  our 
recommendations  for  how  to  proceed  with  a  systems  development  program. 

The  first  step  is  to  define  the  system  requirements.  At  a  minimum,  we 
suggest  the  following  requirements.  The  proposed  system  should  be  flexible 
enough  to  handle  both  tactical  and  strategic  CONOPS.  The  system  should  also 
be  able  to  transition  from  tactical  to  strategic  modes  of  operation.  On-line  real¬ 
time  speaker  training  should  be  a  required  capability.  New  speakers  should  be 
trained  without  the  need  to  retrain  all  other  speakers. 

The  algorithms  we  discussed  in  this  report  can  meet  these  requirements. 
Therefore,  we  suggest  that  the  algorithms  can  be  immediately  ported  to  the 
appropriate  digital  signal  processing  platform  to  develop  an  initial  prototype. 
We  are  confident  that  the  initial  prototype  could  perform  well  on  its  own. 
Improvements  can  be  made  by  addressing  the  outstanding  issues  we  presented 
in  the  previous  section.  The  prototype  development  and  the  additional  research 
can  be  conducted  simultaneously.  As  improvements  are  found  in  the  research, 
they  can  be  incorporated  into  the  working  prototype.  In  this  way  a  working 
prototype  can  be  ready  for  testing  in  the  first  few  months  of  the  effort.  Any 
improvements  can  be  added  throughout  the  life  of  the  program. 

In  our  opinion,  one  of  the  most  important  requirements  we  suggested  was  the 
capability  to  transition  from  a  tactical  to  a  strategic  environment.  A  typical 
scenario  follows.  Suppose  the  mission  is  to  track  the  communications  of  a  group 
of  target  speakers  over  some  unspecified  period  of  time.  Furthermore,  suppose 
no  training  data  is  available  prior  to  the  first  mission.  The  key  is  to  have  a  two 
part  system  combined  into  one.  The  first  part  is  comprised  of  a  tactical  system 
that  is  cleared  and  retrained  on-line  each  day.  The  second  system  is  the  strategic 
system  which  is  trained  off-line  but  is  used  daily  in  performance  mode  with  the 
tactical  system.  The  system  details  are  described  in  the  next  paragraph. 
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The  tactical  system  is  comprised  of  classifiers  that  process  the  liftered 
cepstrum  and  delta  cepstrum  of  each  speaker.  The  strategic  system  is  comprised 
of  classifiers  that  process  the  RASTA-liftered  cepstrum  (or  other  channel 
normalizing  parameters)  and  the  delta  cepstrum.  On  each  mission,  the  operator 
will  train  the  tactical  system  with  the  first  few  transmissions  of  each  target 
speaker.  The  training  is  done  on-line,  one  speaker  at  a  time.  The  trained 
speaker's  classifiers  can  be  placed  in  performance  mode  immediately  after 
training.  Therefore,  the  operator  has  the  capability  to  track  and  train 
simultaneously.  As  the  system  identifies  speakers  during  performance,  the 
identified  speakers'  transmissions  (up  to  a  predefined  number)  are  recorded  for 
later  use.  At  the  end  of  the  mission,  analysts  can  determine  which  speakers  were 
correctly  identified.  Their  corresponding  transmissions  can  then  be  used  to  train 
or  retrain  the  strategic  system.  The  tactical  system  is  cleared  after  each  mission 
and  the  process  is  repeated  on  the  following  mission.  After  the  initial  mission, 
the  strategic  system  is  used  in  performance  mode  along  with  the  tactical  system. 

This  process  can  be  repeated  for  several  missions  until  the  analysts  are 
satisfied  that  enough  channel  variability  has  been  encotmtered  and  the  strategic 
system  is  performing  satisfactorily.  After  this  initial  set  of  missions  -  we  will  call 
it  the  Priming  Phase  -  the  tactical  system  need  not  be  used  again  and  the  strategic 
system  can  take  over.  After  the  Priming  Phase  no  human  intervention  will  be 
necessary  during  the  subsequent  missions,  which  we  will  call  the  Automatic 
Phase.  After  each  mission  in  the  Automatic  Phase,  the  analysts  can  perform  the 
same  job  as  before  but  on  the  strategic  system.  The  analysts  will  decide  whether 
or  not  to  retrain  the  strategic  system  after  each  mission. 

There  could  be  several  variations  to  this  theme.  For  example,  the  tactical 
system  could  be  tuned  for  minimal  human  interaction.  During  each  mission  in 
the  Priming  Phase,  the  operator  may  train  the  tactical  system  with  only  two  or 
three  speakers  then  allow  the  system  to  take  over  automatically.  Every  speaker 
the  tactical  system  rejects  as  a  target  speaker  can  be  trained  automatically.  In  this 
way  all  speakers  encountered  during  each  mission  will  be  trained  by  the  tactical 
system.  The  onus  will  be  on  the  analysts  to  consolidate  the  desired  target 
speakers  and  their  transmissions  for  training  the  strategic  system. 

To  summarize,  we  believe  we  have  the  technology  at  hand  to  begin  an 
operational  speaker  identification  development  program  immediately.  The 
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technology  can  support  high  performance  speaker  identification  of  at  least  50 
speakers  in  a  wide  range  of  channel  conditions.  The  technology  can  support  both 
tactical  and  strategic  needs.  The  technology  can  also  support  real-time,  on-line 
training  and  real-time  identification  during  performance.  We  can  also  make 
improvements  to  this  impressive  list  of  capabilities  during  an  initial  prototype 
development  program. 
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Appendix  1 


The  LPC  Cepstrum 

Derivation  of  Linear  Predictive  Coefficients 

Given  a  sampled  signal  s{n),  we  want  to  estimate  the  signal  at  sample  n  from 
the  weighted  sum  of  a  number  of  previous  samples: 

P 

s(.n)  =  a(i)sin  - 1)  +  e(n)  (1) 

i=l 

Where  e(n)  denotes  the  error  in  the  estimate.  In  vector  notation,  equation  (1) 
becomes 


s  =  Sa  +  e  (2) 

We  want  to  find  the  linear  predictive  coefficients  a  such  that  the  mean  squared 
error  is  minimized.  In  other  words  we  want  to  minimize 

£  =  ||e|f=e»e  (3) 

where  *  denotes  the  conjugate  transpose. 

The  easiest  way  to  find  the  coefficients  a  is  by  proving  that  the  error  is 
minimized  when  the  coefficients  a  are  orthogonal  to  the  samples  S  (i.e., 
E{j(n  -  i)e(n)}  =  S  *  e  =  0,  where  E{»}  is  the  expectation  operator). 

Theorem 

(a)  The  coefficients  a  that  minimize  E  satisfy 

S*e  =  S*(s-Sa)  =  0  (4) 

(b)  The  resultant  minimum  is  given  by 

||e„^|f  =  e*s  (5) 

Proof  Let  a  satisfy  (4).  Consider  another  set  of  weights  b,  and  let  f  be  the  new 
error,  s-Sb.  Therefore 
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f  =  s  -  Sa  +  Sa  -  Sb 
=  e  +  S(a  -  b) 


||fr=||e  +  S(a-b)f 

=  [e  +  S(a-b)]*[e  +  S(a-b)] 

=  e*e  +  e*S(a-b)  +  (a-b)*S*e  +  [S(a-b)]*[S(a-b)] 

The  second  rnd  third  terms  are  zero  by  (4).  Thus 

iff  =  lief +||S(a-b)f 


with  equality  if  a  =  b.  This  proves  part  (a).  Now  the  error  term  is 

lief  =e*e 
=  e*(s-Sa) 

=e*s-e*Sa 

but  by  (4),  e  *  S  =  0,  which  proves  part  (b). 


Now  we  can  proceed  to  the  solution  of  the  LPCs  by  using  the  preceding 
theorem  as  follows.  First  let  us  rewrite  the  error  term  from  (1)  as 

p 

ein)  =  '^aii)s(n-i)  (6) 

i=0 


where  a(0)  =  1  and  with  the  understanding  that  the  rest  of  the  coefficients  have 
changed  sign.  Hence, 

E{j(n  -  j)e(n)}  =  -  ;)^a(i)5(n  -  z) 

\  «=o  (7) 

=  -  i)s(n  -j)  =  0 

1=0  n 


and  from  Equation  (5)  the  minimum  error  is  given  by 


=  0  (10) 

i=0 


t,a(i)r>  =  E  (11) 

i=0 

The  LPCs  can  be  found  directly  from  Equation  (10)  using  Gaussian 
elimination;  however,  Levinson  and  Durbin  (see  [13])  foimd  an  iterative  solution 
by  using  (10)  in  matrix  notation  and  adding  (11)  as  an  auxiliary  equation  in  the 
matrix.  The  resulting  matrix  becomes 


There  are  only  positive  subscripts  for  r  in  (12)  since  for  a  real  sequence  s,  r_i  =  r,. 
The  trick  to  the  iterative  solution  came  from  the  observation  that  matrix  R  is 
both  symmetric  and  Toeplitz. 

Suppose  we  have  solved  (12)  for  p  =  l.  This  means  that  we  know  0,(1)  (the 
subscript  denotes  the  iteration  number)  for  which 
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(13) 


>0  1 

'K 

/i  ^oJUi(l). 

o_ 

and  we  want  to  solve 


"'•o  n  r/ 

1 

'e; 

^1  ^0  n 

«2(1) 

= 

0 

1 _ 

a,(2)_ 

0 

(14) 


Since  R  is  "^oeplitz  and  symmetric,  equation  (14)  can  be  solved  if  the  vectors 
were  reversed  as 


'O' 

1 

c 

A. 

We  can  solve  (14)  by  making  the  following  substitution: 


1 

1 

■  0  ■ 

a,(l) 

= 

«,(!) 

0,(1) 

_a,(2)_ 

_  0 

1 

where  is  some  constant.  Thus 


''’o 

^2] 

T  1 

■  0  ■ 

'e; 

A' 

fl 

'•o  ^1  • 

0,(1) 

0,(1) 

.= 

0 

+  k^ 

0 

Jl 

^oj 

i  0  - 

1 

.q. 

.^1. 

(15) 


(16) 


(17) 


where  q  is  another  term.  Equation  (17)  can  have  a  solution  only  if  the  right  hand 
side  is  all  zeros  under  the  partition.  We  can  accomplish  this  by  forcing  kj  to 
satisfy 

q  +  =  0 

Therefore,  we  find  that  the  solution  to  (14)  is  (16)  where 
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(18) 


=  -^Efl(0r2-, 

^1  1=0 

From  (14)  and  (17)  we  find  our  new  predictor  error  to  be 

(19) 

We  can  finally  generalize  Equations  (16),  (18),  and  (19)  for  the  nth  iteration 
and  state  the  solution  in  the  following  iterative  form: 

1.  For  n  =  0,  Eg  =  Tq 

2.  For  n  >  1 

a)  (*>„_,• 

^n-\  1=0 

b)  ci„in)  =  k„ 

c)  For  i  =  1  to  n  - 1 

a„(0  =  a„-i(0  +  M„-i(”-0 

d)  E„=E„_,(l-e„) 

This  algorithm  is  know  as  the  autocorrelation  or  Levinson  and  Durbin 
method  and  was  restated  from  [13].  There  are  other  methods  of  finding  the 
LPCs,  each  with  their  strengths  and  weakness.  The  autocorrelation  method  is  the 
most  popular  in  speech  processing. 

Derivation  ofCepstral  Coefficients  from  LPCs 

The  all-pole  model  of  the  vocal  tract  is  defined  as  (see  [6],[7],  or  [13]) 

(20) 
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where  H{z)  and  A{z)  are  the  Z  transforms  of  the  impulse  response  h{n)  of  the 
vocal  tract  filter  and  predictor  coefficients  fl(«)  respectively.  By  definition,  the 
cepstrum  of  h{n)  is 

C{z)^\ogHiz)  (21) 


C{z)  =  log 


U(z); 

=  log  £- log  A(z) 


Now  take  the  derivative  of  both  sides  of  (22)  and  multiply  each  side  by  z, 

zdCjz)^  zdAjz)  1 
dz  dz  A(z) 


multiply  by  A(z), 


zdC(z)A(z) 

dz 


zdAjz) 

dz 


and  find  the  inverse  Z  transform 


-nc(n)*a(n)  =  na(n)  (25 

We  now  expand  the  convolution  on  the  left  side  and  regroup  some  terms. 

-^(n  -  k)c(n  -  k)a(k)  =  na(n) 


-nc{n)  -  ^(n  -  k)c(n  -  k)a(k)  =  na(n) 

k=l 

c(n)  =  -a(n)  -  — -cin  -  k)a{k) 
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From  (21),  for  n  =  0,  c(0)  =  logE  and  since  c(m)  =  0  for  m  <  0  the  final  form  of 
the  iterative  equation  for  n>0  becomes 


c(n)  =  -a(n)  - 


X - c{n-k)a{k) 

TZ  n 


(26) 


The  Delta  Cepstrum 

The  derivative  of  each  cepstral  coefficient  at  frame  n  is  approximated  using 
an  N  point  linear  regression  as  follows: 

r  =  — — N  is  odd  (27) 


(28) 


AC,(n)  = 


r 

I 


mQ(n  —  m) 
R 


(29) 


The  Hamming  Window  Lifter 

To  lifter  the  cepstrum  c(n)  of  some  frame  of  speech,  perform  a  point-by-point 
multiplication  c(n)  with  a  lifter,  or  window  win).  We  obtained  the  best 
performance  from  the  Hamming  window,  which  is  defined  as 


w(n)  = 


0.54  —  0.46cos^-^^j,  for  jn]  <  N 


0, 


otherwise 


(30) 
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Appendix  2 


Perceptual  Linear  Prediction 

PLP  Model  and  Equations 

The  PLP  algorithm  is  an  auditory-like  model  of  the  auditory  periphery  that 
encompasses  some  psychophysical  concepts  of  hearing  such  as  logarithmically 
scaled  critical  band  filters,  equal  loudness  preemphasis,  and  the  intensity- 
loudness  power  law.  In  general,  the  PLP  technique  incorporates  these  three 
concepts  within  the  well  known  and  efficient  linear  predictive  all-pole  model  of 
speech.  A  full  discussion  of  the  PLP  process  and  its  theoretical  underpinnings 
are  found  in  [1]. 

The  PLP  process  begins  by  finding  the  spectrum  of  each  speech  frame  by 
using  the  FFT.  More  precisely,  given  a  windowed  sampled  speech  signal  frame 
s(n), 

S(k)  =  "£s(n)e~''^'"  k  =  0X...,N-l  (1) 

fl=0 

P(k)  =  \S(kf  (2) 

where  N  is  the  number  of  points  in  the  frame  s(n).  From  Equation  (1),  we  know 
the  frequency  resolution  /<,  of  S  by 

A  =  f  (3) 

where  F,  is  the  sampling  frequency. 

Bark  scaled  critical-band  filter  banks  are  then  computed  from  the  spectrum 
and  the  critical-band  filters.  In  general,  the  uniformly  scaled  frequency  axis  /  of 
the  spectrum  is  warped  into  the  logarithmically  scaled  Bark  frequency  axis  ^  by 

^(/)  =  61n{//600  +  [(//600)'+l]“}.  (4) 

The  warped  spectrum  is  then  convolved  with  the  critical  band  filter.  Hermansky 
developed  the  following  piece-wise  implementation  of  the  filter 
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(5) 


0  for  ^<-1.3, 

102.5(^+05)  for -1.3 < I  <-0.5, 

X(<^)  =  -1  for -0.5 <  I  <0.5, 

10-1.0(1-0.5)  for  0.5  <^<2.5, 

0  for  ^>2.5. 

A  more  practical  approach  used  by  Hermansky  is  to  design  Bark  scaled  filter 
bank  functions  in  the  uniform  frequency  /domain  based  on  Equation  (5)  and  the 
inverse  of  Equation  (4),  which  is 

/  =  600sinh(|/6).  (6) 

This  is  done  by  determining  the  number  of  filters  required  to  cover  the  entire 
bandwidth  of  the  signal  in  approximate  unit  Bark  intervals.  The  bandwidth  of 
the  signal  in  Bark  is  found  by  inserting  the  bandwidth  of  the  signal  in  Hz  in 
Equation  (4).  For  this  investigation,  we  used  8  KHz  sampled  speech  signals; 
thus,  the  bandwidth  in  Bark  is  approximately  15.58.  A  total  of  17  filters  were 
chosen,  spaced  at  0.9734  Bark.  Now  that  the  ends  points,  in  Bark,  of  each  filter  is 
known,  the  end  points  in  Hz  can  be  computed  using  Equation  (6).  The  closest 
frequency  index  integers  fcare  then  determined  for  each  filter  endpoint.  The 
actual  weights  of  each  filter  can  now  be  computed  by  iterating  through  the 
limits  of  each  filter,  finding  the  associated  frequency  /  in  Hz  of  each  k, 
computing  the  associated  Bark  frequency  and  finally  finding  the  appropriate 
weight  value  of  the  filter  at  that  point  in  frequency  by  using  Equation  (5).  One 
must  realize  that  the  reflection  of  Equation  (5)  is  actually  used  since  X((^)  is 
convolved  with  the  warped  spectrum  5(^)  or 

A(A)  =  2P(0X(A-^).  (7) 

To  incorporate  equal-loudness  weighing  of  the  spectrum,  the  critical-band 
filters  are  preemphasized  using  the  following  equation 

E(f)  =  [(f  + 1200^)/"]/[(/^  +  400^f  (/  +3100^)].  (8) 

Once  the  critical  band  filters  are  computed,  the  Bark  filter  bank  representation  of 
the  signal  can  be  found  by 

A(i)=^w.(k)Pik)  1  =  1,2,...,/  (9) 

*=*a 
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where  I  is  the  number  of  filter  banks,  A:„  is  the  lowest  frequency  index  of  the  zth 
filter,  and  is  the  high  frequency  index  of  the  zth  filter. 

The  last  operation  in  the  auditory-like  process  is  to  compress  the  Bark  filter 
bank  spectrum  with  the  intensity-loudness  power  law.  Hermansky  suggested  a 
cubic-root  compression,  producing 

B(z)  =  A(z)°-^^  z  =  l,2,...,/.  (10) 

At  this  stage  in  the  PLP  process,  the  all-pole  model  of  the  signal  can  be 
computed.  We  know  that  the  autocorrelation  function  R{n)  of  s{n)  can  be  found 
by 

R(n)  =  Z-'{S(k)S(ky} 

=  3“  {Ml  (11) 

=  3-‘ {/>(*)} 

where  3  '  is  the  inverse  FFT.  Therefore,  autocorrelation-like  parameters  can  be 

obtained  from  the  inverse  FFT  of  B(z‘),  which  was  derived  from  P(k).  Since  we 

want  real  valued  autocorrelation-like  coefficients,  the  cosine  transform  is  used 

instead  of  the  FFT;  thus 
/ 

=  7  =  1,2,...,/  (12) 

i=l 

where  J  is  the  order  of  the  all-pole  or  linear  predictive  model.  The  RQ)  can  now 
be  used  in  the  autocorrelation  method  for  finding  the  linear  predictive 
coefficients  (LPCs)  (see  Appendix  1). 
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Appendix  3 
RASTA  Filtering 


The  RASTA  Filter  Equations 


Hermansky  [1]  developed  a  simple  technique  for  reducing  channel  induced 
distortions  in  speech.  He  called  the  procedure  the  RelAtive  SpecTrAl  (RASTA) 
methodology.  The  RASTA  procedure  bandpass  filters  each  critical  band  filter 
bank  spectral  channel  B(i)  through  a  filter  with  the  following  system  function 


H(z)  = 


0.1(2.0  +  1.0z~'-1.0z'^-2.0z~^) 
z-^(l-.98z-‘) 


(1) 


prior  to  the  all-pole  modeling  step.  In  other  words,  the  ith  filtered  chaimel  at 

A 

frame  n,  is  foimd  by 


ki,n) 


0.2B(/,n)  +  0.1B(i,n  - 1)  -  0.1B(i,n  -  3)  -  0.2B(j,«  -  4)  +  0.98B(i,n  - 1),  for  n  >  4 
0,  for  n  <  4 


(2) 


The  LPC  cepstrum  can  also  be  filtered  in  the  same  way. 
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Appendix  4 


LBG  Vector  Quantization 


Codebook  Training 

The  Linde-Buzo-Gray  (LBG)  algorithm  clusters  an  N  dimensional  training 
data  set  into  K  codewords,  or  centroids,  in  N  dimensional  space,  such  that  all 
vectors  m  thi  training  set  can  be  (Quantized  with  minimum  distortion.  We  use  the 
mean  squared  error  (MSE)  distortion  measure  in  this  research.  The  number  of 
clusters  K  is  chosen  by  the  user.  The  LBG  algorithm  begins  by  initializing  one 
codeword  to  the  centroid  of  all  input  vectors  .  On  each  iteration  of  the  algorithm, 
each  existing  codeword  is  split  in  two  by  taking  +  or  -  1%  of  each  codeword's 
element.  After  the  codebook  is  doubled  in  this  marmer,  the  codebook  centroids 
are  adjusted  using  the  LBG  iteration.  This  process  continues  until  the  desired  size 
of  the  codebook  is  generated.  If  /sT  ^  2"’,  for  some  m  then  a  point  will  be  reached 
when  doubling  the  codebook  will  make  it  too  large.  At  this  point  individual 
codewords  must  be  chosen  for  splitting.  Two  possible  criteria  are  used  fcr 
splitting:  a  density  criteria  (the  codeword  that  decodes  the  largest  number  of 
training  vectors  is  split)  and  a  largest  average  MSE  criteria  (the  codeword  that 
produces  the  largest  average  MSE  over  its  set  of  training  vectors  is  split). 


LBG  Iteration 

A)  Initialize  Codebook:  As  described  above 

B)  Iteration: 

1.  For  each  training  vector  x,  quantize  x  into  the  codeword  x^^. 
such  that 

^*  =  argmin|jx-X;tf 


2.  Compute  the  total  average  distortion  D  produced 
by  each  vector  x  and  the  new  codebook. 
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2 


D=Sk-e(x,)|i 

where  L  is  the  number  of  training  vectors  and  Qix.)  is  the  codeword  that 
vector  X;  was  assigned  to  in  steps  1.  If  D  has  changed  less  than  some 
small  amount  stop. 

3.  Find  new  centroid  of  each  codeword  and  go  to  step  1. 
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Appendix  5 

Classification  Based  on  Score  Statistics 

In  section  3.4,  we  found  that  a  candidate  speaker  is  chosen  when  a  test 
transmission  produces  the  lowest  average  mean  squared  error  score  against  that 
speaker  s  codebook.  We  can  also  use  the  score  statistics  of  the  speakers  for  our 
decision  making  process.  For  example,  we  can  compare  a  speaker's  test  score 
against  the  scores  that  speaker's  codebook  has  produced  in  the  past.  In  this  way 
we  can  make  a  decision  based  on  the  speaker's  score  relative  to  the  scores  his 
codebook  has  produced  for  his  speech  as  well  as  everyone  else's  speech  in  the 
target  group.  In  other  words,  we  want  to  find  the  probability  that  a  test 
transmission  was  generated  by  speaker  j  given  speaker  j 's  current  score  is  less 
than  what  his  codebook  has  previously  generated.  The  speaker  that  produces 
the  largest  probability  is  considered  the  candidate  speaker. 

More  specifically,  for  a  set  of  target  speakers  C,  choose  speaker  k  as 
candidate  if 


k  3  >  e,)  =  maxP(5^.|£.  >  e.)  Vj  e  C  (1) 

where  Sj  denotes  the  ;  th  speaker,  Cj  is  Sj 's  classifier  output  score,  and  is  the 
set  of  scores  previously  produced  by  Sj.  To  find  the  desired  probability  we  use 
Bayes'  rule: 


P{spj>e,) 


P{E,>e,) 


(2) 


We  can  find  each  term  on  the  right  hand  side  of  (2)  in  a  straightforward 
manner.  We  know  that  the  cumulative  distribution  function  of  some  random 
variable  X  is  defined  as 


and 


F^ix)^P{X<x) 

(3) 

P{X>x)^l-P{X<x) 

(4) 
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Therefore,  p{^Ej>ej^Sj^  is  one  minus  the  cumulative  distribution  of  scores 
produced  by  encoding  speaker  y's  previous  transmissions  -  usually  the  training 
transmissions  -  through  speaker  j's  codebook.  Similarly,  P{^Ej  >  is  one  minus 

the  cumulative  distribution  of  scores  produced  by  encoding  the  training 
transmissions  of  all  speakers  through  speaker  j's  codebook.  The  probability  of 

encountering  speaker  ;,  P[Sj),  is  easily  found  by  dividing  speaker  j's  training 
signal  length  by  the  sum  of  the  training  signal  lengths  of  all  speakers  in  the 
target  set. 

If  all  speakers  are  regarded  as  equally  likely,  then  P{^Ej>ej^  must  be 
normalized  to  account  for  equally  probable  speakers.  Equation  (1)  becomes: 


where  P(^Ej  >  Cj)  is  the  appropriately  normalized  P{^Ej  >ej)- 
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Appendix  6 

Out-of-Set  Detection 


Definitions 

C  =  ,  ^2 ,  •  •  • ,  }  -  Set  of  JC  trained  speakers 

~  k  —  ■  Speaker  fc^s  codebook  of  codewords 

0  =  {  01 , 02 ,  •  •  • ,  }  -  Test  signal  made  up  of  M  vectors 

~  k  =  l,2,"-,K  -  Speaker  fc's  training  signal 

j  M 

d(Q,Qk)  =  — 5Jmin||0,.  -  ^^.||,  j  =  AT  -  Speaker  k's  codebook  output  score 
Cj  =  min d(0, (2* ),  k  =  l,2,---,K  -  Candidate  speaker  j 's  score 

Global  MSE  Score  Threshold 

Given  global  threshold  T  and  candidate  speaker  Sfs  output  score  ej,ii  e  >T 
then  reject  Sj  as  target  speaker,  otherwise  accept  Sj  as  target  speaker. 

N  Closest  Speakers'  Scores  Threshold 

For  each  speaker  j  test  all  other  speakers  training  signals  against  speaker  j 's 
codebook  and  rank  the  resulting  scores.  In  other  words  find 

di{^„Qj),k:^j,i  =  l,2,-K  (1) 

where  df»)  is  the  lowest  score  and  dfr{*)  is  the  highest  score.  Now  find  the 
following  threshold: 


(2) 

”  /=i 

where  N  is  predetermined  by  the  user.  If  Cj  >  Tj  then  reject  Sj  as  target  speaker, 
otherwise  accept  Sj  as  target  speaker. 


50 


Cohort  Normalized  Scores  Thresholds 

Find  Sj  cohorts  by  testing  speaker  j's  training  signal  against  all  other  speakers 
codebooks  and  rank  the  resulting  scores  in  a  similar  fashion  as  in  (1): 

dl^j,Q,),k^j,i  =  l,2,-K  (3) 

The  N  cohorts  are  the  speakers  that  produced  N  lowest  scores.  We  will  call 
speaker  j's  cohorts  group  J.  Now  average  the  cohorts' scores: 


^  1=1 


Finally,  upon  testing  if 


then  accept  Sj  as  target  speaker,  otherwise  reject  Sj. 


Probabilistic  Thresholds 

Refer  to  Appendix  5  for  definitions.  If  P{^Ej  >ej\Sj^<Tj  then  rejects  Sj  as 
target  speaker,  otherwise  accept  Sj. 


Appendix  7 


Results  Print  Outs 
KING  Database  Results 
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This  IS  the  MSE  ratios  of  14th  order: 

1)  liftered  cepstrum 

2)  delta  cepstrum  (7  point  linear  regression) 

3)  acceleration  cepstrum  (7  point  linear  regression) 

4) RASTA  and  liftered  cepstrum 
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wrong:  6  Percent  Correct:  76.923 
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percent  correct  RASTA-liftered  cepstrum  =  63.522013 
percent  correct  delta  cepstrum  =  66.666667 


Appendix  8 


Status  Reports 

1.0  Status  Report  1 

In  the  first  status  report  we  presented  the  results  of  the  preliminary 
experiments  we  conducted  with  the  Backpropagation  network  and  the  Real-Time 
Recurrent  Learning  (or  Recurrent  Backpropagation).  During  the  reporting 
period  covered  by  status  1  we  also  attempted  to  duplicate  the  results  that  Kao 
and  his  colleagues  had  recently  reported.  Those  results  were  some  of  the  best 
reported  as  of  the  beginning  of  this  effort.  We  also  investigated  different 
windowing  (littering)  functions  and  conducted  several  experiments  with 
different  training  and  testing  intervals. 

1.1  Previous  Results 

This  section  describes  recent  results  in  speaker  identification  in  the  open 
literature. 

Rahim  (1992),  and  Burr  (1992)  have  reported  recognition  rates  of  over  98%  in 
48-50  speaker  experiments  using  wideband,  high  signal-to-noise  ratio  (SNR) 
speech  (TIMIT).  In  both  cases,  the  investigators  used  the  LPC  derived  cepstral 
representation  of  speech  as  input  parameters  to  a  vector  quantizer  as  the 
classifier.  However,  recognition  rates  drop  nearly  30  percentage  points  when 
narrowband  telephone  channel  speech  is  tested  (Burr,  1992).  Burr  foimd  that  by 
including  the  delta  cepstrum,  he  could  regain  approximately  15%  in  performance 
for  a  total  of  about  85%  recognition  rate . 

Kao.  et  al.  (1992)  have  conducted  an  intensive  coiriparative  analysis  of  several 
front  end  features  using  a  VQ  classifier.  Their  baseline  results  were  obtained 
using  20  cepstral  coefficients  derived  from  14th  order  LPC  coefficients  for  each 
voiced  frame  of  speech.  They  used  a  30  msec.  Hamming  windowed  frame  at  a  20 
msec,  frame  rate.  The  50  speaker  KING  speech  database  was  used  in  their 
analysis.  The  KING  database  has  several  recording  and  channel  induced  classes. 
There  are  a  total  of  10  sessions  in  which  all  50  speakers  are  represented.  Sessions 
1-5  and  sessions  6-10  were  recorded  using  two  different  recording  setups 
respectively.  The  two  recording  setups  induced  two  distinct  channel  conditions 


99 


on  the  speech.  Furthermore,  26  speakers  were  recorded  at  a  site  in  San  Diego, 
CA  and  the  other  24  speakers  were  recorded  at  a  site  in  Nutley,  NJ.  The  Nutley 
recordings  are  much  noisier  than  the  San  Diego  recordings.  The  table  below 
describes  the  results  using  sessions  1-5. _ 


San  Diego 

Nutley 

All 

Baseline 

81.73% 

35.00% 

58.82% 

BPL 

85.58% 

47.00% 

66.67% 

RASTA 

91.35% 

50.00% 

71.08% 

BPL  &  RASTA 

94.23% 

61.00% 

77.94% 

Table  1 


BPL  stands  for  bandpass  littering.  The  lifter  w(k)  used  is  described  as: 

w(k)  =  l  +  hsin(7ck/ L);  ^  =  1,2,...,L;  h  =  LI2  (1) 

Where  L  is  the  number  of  cepstral  coefficients;  in  this  case  20. 

The  RASTA  (RelAtive  SpecTrAl)  process  (Hermansky,  et  al.,  1992)  bandpass 
filters  each  cepstral  channel.  In  other  words,  each  cepstral  charmel  c^in)  is 
convolved  with  the  impulse  response  hin).  The  system  function  used  was 
,  0.1(2.0  +  1.0z-^-1.0z-^-2.0z~^) 

^  z“^(l-.98z"‘) 

It  is  important  to  note  that  Kao,  et  al.  found  no  improvement  in  performance 
using  Hermansky's  (1990)  Perceptual  Linear  Predictive  (with  RASTA  filtering) 
features. 

1.2  Baseline  Results 

We  chose  to  baseline  the  backpropagation  classifier  and  the  pervasively  used 
generalized  Lloyd's  vector  quantization  (VQ)  classifier  with  cepstral  front  end 
features  early  in  this  project  to  have  a  readily  available  comparative  base. 
Algorithms  that  do  not  quickly  meet  baseline  performance  can  rapidly  be 
discounted  without  having  to  thoroughly  test  the  algorithm. 

Unless  otherwise  stated,  the  front  end  features  used  in  all  tests  are  14th  order 
LPC  derived  cepstral  coefficients.  The  cepstral  coefficients  are  derived  from  the 
predictor  coefficients  by  the  following  recursive  procedure: 
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(3) 


q=-LPQ-f,‘—C,_,LPC„ 

were  N  is  the  LPC  order.  A  32  msec  frame  with  a  Hamming  window  at  a  16 
msec  frame  rate  was  used.  Only  voiced  frames  of  speech  were  used  for  both 
training  and  testing.  All  signal  processing  was  done  with  Entropic's 
ESPS/Waves  speech  processing  software. 

ll.lBackpropagation:  The  first  set  of  tests  were  performed  with  the 
backpropagation  (BP)  classifier,  with  training  on  the  first  16  speakers  of  the 
KING  database,  sessions  1-3,  and  testing  on  the  same  group  of  speakers  from 
sessions  4  and  5.  In  order  to  have  an  even  amount  of  training  data  from  each 
speaker,  we  found  the  speaker  from  a  given  session  with  the  least  number  of 
voiced  frames  of  speech  and  combined  that  number  of  voiced  frames  from  each 
session.  The  result  was  9.6  sec.  (3.2  sec.  from  each  session)  of  training  from  each 
speaker.  We  also  chose  3.2  sec.  of  testing  from  sessions  4  and  5.  In  all  cases  we 
allowed  the  network  to  build  evidence  on  a  frame-by-frame  basis  for  each  test 
file  before  reporting  results. 

In  the  first  test,  we  separated  the  speakers  into  four  groups  of  four  speakers 
each  and  trained  four  separate  BP  classifiers  with  each  group.  We  tested  the  BP's 
ability  to  correctly  classifying  in-class  speakers  and  its  ability  to  reject  out-of-class 
speakers  with  thresholding.  The  recognition  rate  for  the  four  subnet  classifier, 
each  with  4  hidden  nodes,  was  40%  with  a  15.83%  false  alarm  rate. 

Next  a  BP  network  with  40  hidden  nodes  was  trained  on  all  16  speakers  and 
produced  a  recognition  rate  of  53.3%.  These  results  were  obtained  after 
approximately  100  epochs. 

The  final  BP  experiment  was  performed  by  training  the  same  40  hidden  node 
network  on  all  voiced  frames  of  sessions  1-3  and  tested  with  all  voiced  frames 
from  session  4  and  5  combined.  The  result  was  50%  recognition.  It  is  interesting 
to  examine  the  differences  between  this  experiment  and  the  previous  one  noted. 
When  the  network  was  trained  with  a  balanced  training  set  (9.6  sec. /speaker) 
the  correct  speaker  was  typically  in  second  or  third  place  when  the  network 
erred.  The  wrong  speakers  chosen  appeared  to  be  chosen  randomly.  However, 
when  using  all  the  data  in  the  sessions  for  training,  a  few  speakers  had  much 
more  training  data  than  others.  In  fact,  speaker  4  had  2-3  times  more  data  than 
50%  of  the  other  speakers.  Under  this  condition,  speaker  4  was  chosen 
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erroneously  in  over  70%  of  the  missclassifications  with  the  number  2  and  3  pick 
ranked  very  nearly  the  same  as  if  ranked  by  the  amount  of  training  data. 

These  results  are  consistent  with  many  previous  observations  that  the 
backpropagation  network  is  computing  the  a-posteriori  probability  of  a  given 
event.  That  may  be  useful  under  certain  conditions  but  it  is  detremental  in  this 
project  because  we  may  want  to  identify  an  individual  with  very  high  accuracy 
regardless  of  their  probability  of  occurrence. 

1.2.2  Real-Time  Recurrent  Learning:  We  made  an  initial  test  of  the  recurrent 
backpropagation  network  with  the  real-time  current  learning  (RTRL)  algorithm. 
A  full  description  of  the  algorithm  is  found  in  Williams  and  Zipser  (1989).  The 
initial  test  was  performed  to  see  if  the  RTRL  could  indeed  separate  speakers  from 
a  temporal  presentation  of  cepstral  coefficients.  The  speech  preprocessing 
performed  was  the  same  as  previously  described  in  this  section  except  that  we 
used  only  3.2  sec.  of  training  from  session  1  and  .5  sec  of  testing  from  session  2. 
The  experiment  was  only  for  three  speakers.  The  network  correctly  classified  all 
three  speakers  upon  testing.  With  this  result,  we  now  feel  more  comfortable 
performing  extensive  testing  of  the  RTRL  network.  However,  much  has  been 
written  about  the  excessive  training  times  required  of  recurrent  networks  and 
their  difficulties  in  scaling  up.  These  are  important  considerations  for  future 
work. 

1.2.3  VQ  Classifier:  In  a  series  of  experiments  using  VQ  we  attempted  to 
duplicate  some  of  the  results  of  Kao,  et  al.  since  these  are  the  best  results  reported 
to  date  using  the  KING  database.  This  implies  that  the  RASTA-liftered  cepstral 
coefficients  are  the  features  of  choice  for  speaker-identification  as  of  today.  The 
speech  preprocessing  was  the  same  as  already  discussed,  except  that  we  first 
scaled  the  data  by  16  to  get  close  to  the  maximum  dynamic  range  of  the  16  bit 
"short"  data  type  used  by  ESPS/Waves.  The  KING  database  was  quantized 
using  12  bits.  We  also  removed  the  dc  value  of  each  file  before  computing  the 
LPC  coefficients.  This  improved  the  performance  of  ESPS's  probability  of  voicing 
algorithm.  We  used  32  codeword  codebooks  for  each  speaker  with  the  mean- 
squared  error  distortion  measure.  The  following  tables  summarize  the  results. 


Training 

Testing 

Testing 

Testing 

sessions 

session  4 

session  5 

sessions  4&5 
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1-3,  all 

81.25% 

68.75% 

87.50% 

1-3, 9.6  sec. 

62.50% 

62.50% 

*(■ 

1-3, 1  sec. 

69.00% 

50.00% 

* 

Table  2 


The  next  set  of  experiments  were  conducted  with  cepstral  littering.  Cepstral 
liftering  is  simply  the  windowing  of  the  cepstral  coefficients  with  some  function. 
The  results  shown  are  from  a  Hamming  window  lifter.  A  triangular  lifter,  cosine 
squared  lifter,  and  sin  lifter  discussed  previously  did  not  improve  the  results  that 
follow. 


Training 

sessions 

Testing 
session  4 

Testing 
session  5 

Testing 
sessions  4&5 

1-3,  all 

87.75% 

75.00% 

93.75% 

1-3, 9.6  sec. 

62.50% 

62.50% 

1-3, 1  sec. 

69.00% 

56.25% 

* 

Table  3 


We  also  conducted  a  26  speaker  experiment  (the  San  Diego  group)  to 
compare  directly  with  Kao's  results.  The  most  comparable  results  using  the  14th 
order  cepstral  coefficients  were  obtained  with  a  40  codeword  codebook.  Kao 
used  the  20th  order  cepstrum  with  32  codewords  per  speaker.  The  following 
table  shows  these  results. 


Training 

sessions 

Testing 
session  4 

Testing 
session  5 

Testing 
sessions  4&5 

1-3,  all 

77.00% 

69.00% 

84.62% 

Table  4 


Our  final  experiment  conducted  in  this  period  was  of  short  speech  segment 
training  and  testing  using  a  Nellis  Green  Flag  database  recorded  and  provided 
by  Rome  Laboratory.  The  database  consists  of  tactical  Green  Flag  exercise 
transmissions  from  Nellis  Air  Force  Base.  From  a  nine  speaker  set  from  four 
different  platforms  (including  an  air  traffic  controller)  the  VQ  classifier  using 
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liftered  cepstral  coefficients  correctly  identified  all  nine  speakers  with  only  6  sec. 
of  training  and  3  sec.  of  testing.  The  actual  amount  of  training  and  testing  data 
used  was  less  than  what  is  stated  because  the  algorithm  uses  only  voiced 
information  as  previously  stated. 

1.3  Conclusion 

This  section  is  a  discussion  of  our  interpretation  of  the  recent  results 
discussed  in  this  report.  By  Rahim  and  Burr's  results,  we  know  that  with  stable, 
wideband  channels  with  high  signal-to-noise  ratios  spatial  representations 
(cepstral)  of  individuals'  speech  are  highly  separable.  This  results  in  very  high 
recognition  rates  with  as  many  as  50  speakers.  Under  the  ideal  signal  conditions 
used  by  Rahim  and  Burr,  simple  VQ  techniques  are  sufficient.  Indeed  they  can 
arguably  be  considered  preferable  to  more  sophisticated  neural  network 
techniques  because  VQ  can  be  implemented  in  real-time  with  modem  computer 
workstations  without  special  hardware. 

When  signal  conditions  are  less  than  ideal,  as  we  find  with  telephone  speech 
data  that  has  been  recorded  over  a  period  of  weeks  or  months  with  different 
equipment,  recognition  rates  are  dramatically  reduced.  The  recognition  rates 
can  be  significantly  improved  by  littering  and  RASTAing  cepstral  coefficients. 
The  cepstral  coefficients  seem  to  be  the  representation  of  choice  to  date.  The 
cepstrum  works  nicely  because  it  appears  to  separate  channel  information,  vocal 
tract  information,  glottal  excitation,  and  external  high  frequency  harmonics  along 
the  quefrency  axis.  Perhaps  this  is  why  the  litter  tc^.d  RASTA  processes  improve 
the  recognition  rates  as  shown  here  because  these  processes  attenuate  channel 
information. 

Given  our  results  on  BP  performance,  we  do  not  recommend  further  testing 
with  the  algorithm.  With  long  training  times  (approximately  5  hours) 
encountered  with  up  to  50  hidden  nodes,  it  makes  little  sense  to  look  for  a  larger 
network  that  can  equal  the  VQ  performance.  It  remains  to  be  seen  whether  other 
non-time-dependent  neural  networks  such  as  SOAP  or  other  more  costly 
classifiers  such  as  Kohonen's  learning  vector  quantizer  or  Gaussian  mixture 
models  can  significantly  improve  results.  We  also  feel  that  given  the 
complexities  of  the  RTRL  network,  this  network  must  significantly  improve 
resiilts  before  it  is  considered  as  a  viable  alternative  to  VQ  approaches. 
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2.0  Status  Report  2 

During  the  period  of  performance  of  the  second  status  report,  we  investigated 
and  fully  tested  RASTA  filtering  of  cepstral  coefficients,  RASTA  PLP,  RASTA 
PLP  with  lateral  inhibition,  and  their  derivatives.  We  first  used  codebook 
fusion  or  adjudication  during  that  period  as  well.  We  also  conducted  further 
experiments  in  short  segment  training  and  testing  from  the  KING  and  Green  Flag 
databases.  Finally,  we  investigate  a  process  called  lateral  inhibition  for  noise 
reduction  during  the  second  period  of  performance. 

2.1  RASTA  PLP  and  Lateral  Inhibition 

Notwithstanding  the  putative  benefits  of  the  auditory-like  processing  of  the 
PLP  model  (see  Appendix  2),  the  PLP  model  lacks  the  important  auditory 
mechanism  of  masking.  We  add  the  masking  mechanism  by  incorporating 
lateral  inhibition  within  the  PLP  process.  Although,  the  masking  phenomenon  is 
believed  to  be  caused  by  the  mechanical  properties  of  the  basilar  membrane, 
masking  can  be  effectively  modeled  with  lateral  inhibition.  Although  lateral 
inhibition  (LIN)  does  not  account  for  masking,  it  is  a  common  processing 
strategy  (across  species)  used  in  all  sensory  systems,  including  the  auditory 
system  **(Knudsen,  1978;  Shepherd,  1988).  LIN  performs  several  functions  of 
which  contrast  enhancement  emd  noise  suppression  are  the  most  important  for 
this  investigation. 

We  model  the  LIN  process  by  convolving  the  negative  of  the  second 
derivative  of  the  Gaussian  with  the  spectrum  of  the  signal.  We  use  the  following 
particular  LIN  function  implementation: 

Hik)  =  [l.O - (;t/iV)"]exp[-(it/Ar)72.0],  k  =  1,2 . K  (4) 

where  K  is  the  number  of  frequency  elements  in  the  spectrum  and  N  determines 
the  bandwidth  of  the  function.  The  PLP  and  LIN  processes  were  developed  to 
introduce  some  observed  biological  phenomena  in  the  auditory-like  model.  In 
keeping  with  this  approach,  care  must  be  taken  in  increasing  the  bandwidth  of 
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the  LIN  filters  according  to  the  critical  band  function,  which  is  defined  as 
***(Zwicker  &  Terhardt,  1980) 

W^(/)  =  25  +  75(l  +  1.4/p  (5) 

To  incorporate  LIN  in  the  RASTA  PLP  process,  one  simply  convolves  Hik)  with 
the  signal  spectrum  P{k)  [Appendix  2,  (2)]  prior  to  computing  the  critical  band 
filter  banks  [Appendbc  2,  (4)]. 

The  next  section  summarizes  the  results  obtained  from  experiments  using 
three  main  classes  of  acoustic  representations:  LPC  cepstrum,  RASTA  PLP,  and 
RASTA  PLP  with  LIN. 

2,2  KING  Speaker  Identification  Test  Results 

The  results  reported  in  this  section  are  based  on  the  26  speaker  San  Diego  set 
of  the  8  KHz  version  of  the  KING  database.  The  generalized  Lloyd's  VQ 
algorithm  was  used  for  designing  the  classifiers  for  all  experiments.  A  40 
codeword  per  codebook  classifier  was  used  for  each  speaker.  A  32  msec  frame 
(256  samples)  with  a  Hamming  window  at  a  16  msec  frame  rate  was  used  on  all 
speech  signals  prior  to  parameterization. 

We  investigated  three  main  classes  of  parameters  and  their  derivatives.  These 
were  the  LPC  cepstrum,  the  RASTA  PLP,  and  the  RASTA  PLP  with  lateral 
inhibition.  In  the  LPC  cepstrum  class,  the  following  parameters  were  computed: 
the  14th  order  LPC  cepstrum;  the  delta  cepstrum,  using  a  seven  point  linear 
regression;  the  littered  cepstrum,  using  a  Hamming  window  lifter;  the  RASTA 
cepstrum;  and  the  RASTA  littered  cepstrum.  The  RASTA  PLP  class  consisted  of 
the  following  parameters:  the  RASTA  PLP  cepstrum,  the  RASTA  PLP  delta 
cepstrum,  and  the  RASTA  PLP  acceleration  cepstrum.  These  parameters  were 
computed  by  Reeves  computer  program.  The  RASTA  PLP  with  LIN  class  was 
similar  to  the  RASTA  PLP  class  but  with  lateral  inhibition  introduced  as  well. 

We  report  on  the  combination  of  features  that  produced  the  best  results  for 
each  of  the  major  parameter  classes  previously  defined.  The  LPC  order  for  the 
LPC  cepstrum  class  was  14,  and  the  LPC  order  for  both  RASTA  PLP  classes  was 
12.  In  both  cases  increasing  the  LPC  order  did  not  increase  performance  with  the 
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40  codeword  codebooks  for  each  speaker.  LR  cepstrum  Table  5  refers  to  the 
cepstrum  that  has  been  liftered  and  RASTA  filtered  as  well. 


LPC  Cepstrum 

RASTA  PLP 

itHsnsisnRsi 

Combination 

liftered  cepstrum, 
delta  cepstrum, 

LR  cepstrum 

cepstrum, 
delta  cepstrum, 
acceleration  ceps. 

cepstrum, 
delta  cepstrum, 
acceleration  ceps. 

Speakers 

Misclassified 

10  and  16 

16  and  17 

3, 9, 17 

%  Correct 

92.3% 

92.3% 

88.5% 

Table  5 


The  results  we  obtained  in  these  experiments  do  not  agree  with  the  results 
recently  reported  by  Kao  and  his  associates  (see  last  status  report).  Their  baseline 
results  on  the  26  speaker  San  Diego  group  using  the  LPC  cepstrum  was  81.73% 
correct.  They  reported  that  the  RASTA  PLP  procedure  did  not  improve  on  this 
baseline.  We  obtained  88.5%  correct  with  the  RASTA  PLP  cepstrum  alone.  As 
can  be  seen  from  the  previous  table,  this  performance  can  be  improved  by  using 
the  delta  and  acceleration  features  as  well.  Kao  also  reported  that  RASTA 
filtering  of  the  cepstrum  increased  performance  to  91.35%  correct,  and  RASTA 
filtering  with  liftering  increased  performance  to  94.23%.  Our  results  were  76.9% 
for  RASTA  filtering  alone,  and  73.1%  for  RASTA  and  liftering  combined.  We  can 
not  explain  the  differences  between  the  two  sets  of  results.  Recent 
communications  with  Kao  have  not  shed  any  light  on  the  problem. 

When  we  first  reviewed  the  results  produced  by  the  disparate  parameters,  we 
found  that  each  parameter  space  caused  different  errors  in  classification,  as 
illustrated  in  Table  5.  We  found  that  the  correct  speaker,  though  improperly 
identified  in  a  given  parameter  space,  for  example  the  liftered  cepstrum,  might 
have  a  very  small  MSE  ratio  in  that  space;  however,  the  incorrectly  chosen 
speaker  in  the  liftered  cepstrum  space  might  have  a  very  large  MSE  ratio  in,  say 
the  RASTA  cepstnun.  When  the  results  from  the  different  parameter  spaces 
were  combined,  improvements  in  the  overall  scores  were  produced.  That  is  why 
seemingly  poor  individual  results,  such  as  RASTA  filtering,  combined  with 
others  increased  overall  performance. 
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In  addition  to  testing  the  parameters  previously  discussed,  we  were  tasked  by 
the  Rome  Lab  Speech  Group  to  test  LPC  parameters  alone,  and  filter  bank 
representations  of  the  spectrum.  LPC  features  alone  produced  46.2%  correct 
recognition,  a  14th  order  uniform  filter  bank  representation  also  produced  a 
46.2%  correct  recognition  score,  and  a  24th  order  mel  filter  bank  representation 
produced  76.9%  correct  recognition.  These  results  support  the  cepstrum  (and  its 
variations)  as  the  feature  of  choice  for  speaker  identification. 

2.3  Short  Segment  Training  and  Testing  Results 

In  this  section,  we  describe  short  segment  training  and  testing  experiments 
performed  on  20  speakers  of  the  KING  database  and  17  speakers  of  the  Green 
Flag  database.  The  motivation  for  these  experiments  is  to  determine  whether 
mission-by-mission  on-line  tactical  training  and  tracking  of  speakers  is  possible. 

Our  conjecture  is  that  it  is  possible  and  the  results  reported  in  this  section 
support  that  conjecture. 

In  order  to  reduce  the  possibility  that  we  are  not  tracking  the  strong 
background  acoustics  of  fighter  aircraft,  we  first  performed  short  segment 
experiments  on  the  KING  database.  Although  each  session  in  that  database  has 
different  telephone  channel  conditions,  we  believe  those  channel  characteristics 
are  well  below  the  energy  levels  of  the  speech  and  therefore  the  chances  that  we 
are  performing  channel  identification  and  not  speaker  identification  are  reduced. 
Furthermore,  by  performing  separate  tests  for  each  of  the  sessions  of  the  database 
we  can  be  more  confident  that  we  are  performing  speaker  identification. 

For  the  first  experiment  we  chose  the  first  20  speakers  to  have  a  total  of  7 
seconds  (5  for  trammg  and  the  next  2  for  testing)  of  voiced  data  in  session  1. 
These  speakers  were  1, 2, 4, 5,  6, 7,  8, 9, 10, 11, 12, 14, 15, 16, 18, 20, 21, 22,  24,  and 
26.  Speakers  3, 13, 17, 19,  and  25  did  not  meet  this  criteria.  For  each  session  1-5, 
we  trained  separate  codebooks  for  each  of  the  20  speakers  chosen  with  5  seconds 
of  voiced  data  (liftered  cepstrum  and  delta  cepstrum).  We  tested  with  the  next  2 
seconds  of  voiced  data  from  within  the  same  transmission  (data  file).  The 
cumulative  results  for  all  5  sessions  was  89.36%  correct  with  liftered  cepstrum 
only  and  95.74%  correct  with  liftered  cepstrum  and  delta  cepstrum  combined 
(through  the  adjudication  procedure  previously  described).  These  are  the 
combined  results  of  94  individual  tests.  It  is  interesting  to  note  that  not  all 
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speakers  in  each  session  had  a  full  two  seconds  of  voiced  data  for  testing.  In 
session  2,  speaker  2  only  had  1.57  sec.,  speaker  9  had  no  test  data  available, 
speaker  16  had  .13  sec.,  speaker  20  had  1.26  sec.,  and  speaker  26  had  only  .1  sec. 
for  testing.  In  session  3,  speaker  5  had  1.34  sec.,  speaker  7  had  no  test  data, 
speaker  9  had  .75  sec.,  speaker  14  had  .05  sec.,  and  speaker  16  had  1.82  sec.  of  test 
data.  In  session  4  speakers  11  and  12  had  no  test  data,  and  in  session  5  speaker  4 
had  .86  sec.,  speaker  9  and  24  had  no  test  data,  speaker  10  had  .24  sec.,  speaker  16 
had  .21  sec.,  and  speaker  22  had  .24  sec.  of  test  data  available. 

In  the  next  experiment  we  reduced  the  amount  of  training  data  to  3.2  seconds 
and  1.6  seconds  for  testing.  As  in  the  previous  experiment,  we  used  testing  data 
from  the  same  transmission  as  the  training  data.  Only  speaker  24  from  session  5 
had  no  test  data  available.  All  other  speakers  in  all  sessions  had  the  1.6  seconds 
of  test  data  available.  The  combined  result  of  99  individual  tests  was  87.88% 
correct  with  littered  cepstrum  only  and  92.93%  with  combined  littered  cepstrum 
and  delta  cepstrum. 

The  next  set  of  experiments  shed  some  light  on  the  performance  that  is  to  be 
expected  on  tactical  AF  commimications.  In  the  previous  status  report  we 
reported  that  we  obtained  100%  recognition  in  a  nine  speaker  experiment  with 
less  than  six  seconds  of  actual  training  data  and  less  than  three  seconds  of  actual 
testing  data.  The  amoimt  of  voiced  information  extracted  from  these  training 
and  testing  transmissions  are  less  than  50%  of  the  actual  transmission  length.  We 
performed  another  Green  Flag  database  experiment  in  this  reporting  period  with 
17  speakers.  One  to  eight  transmissions  were  available  for  the  experiment  for 
each  of  the  17  speakers.  We  arbitrarily  used  the  first  two  transmissions  for 
training  and  the  rest  of  the  transmissions  for  testing.  Each  speaker  had  at  least 
one  and  up  to  eight  test  transmissions  for  testing.  The  longest  combined  training 
transmission  was  7.69  seconds  for  speaker  CCZ,  and  the  shortest  training 
transmission  was  2.47  seconds  for  speaker  CGA.  The  longest  test  transmission 
was  3.81  seconds  for  speaker  CFU,  and  the  shortest  test  transmission  was  1.02 
seconds.  The  combined  result  from  42  test  transmissions  was  92.86%  with  the 
littered  cepstrum  representation  and  90.48%  with  the  littered  cepstrum  and  the 
delta  cepstrum  combined. 

It  is  clear  from  the  results  reported  in  this  subsection  that  mission-by-mission 
on-line  tactical  training  and  tracking  of  speakers  is  possible  with  very  high 
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accuracy.  However,  further  testing  with  more  databases  is  required  to  verify  this 
assertion. 

3.0  Status  Report  3 

In  this  reporting  period  we  conducted  more  short  segment  testing  and 
training,  and  began  a  preliminary  analysis  of  methods  for  open  set  testing.  We 
also  added  the  acceleration  cepstrum  as  an  additional  feature  in  the  LPC 
cepstrum  group.  We  trained  a  new  codebook  for  this  feature  for  all  speakers  in 
the  San  Diego  group,  sessions  1-5;  and  combined  the  results  of  these  new 
codebooks  with  the  previously  computed  results.  We  also  conducted  an  across 
the  great  divide  experiment  with  the  San  Diego  group.  To  satisfy  Rome  Lab's 
request  for  more  short  segment  training  and  testing,  we  conducted  a  37  speaker 
Greenflag  experiment. 

3.1  KING  Database  Experiments 

In  this  section  we  report  on  further  work  conducted  with  the  KING  database. 
As  shown  on  Table  5  in  the  previous  status  report,  the  result  from  a  combination 
of  the  liftered  cepstra,  delta  cepstra,  and  the  RASTA-liftered  cepstra  was  92.3% 
for  the  26  speaker  San  Diego  group,  sessions  1-5.  We  added  acceleration  cepstra 
features  and  adjudicated  the  results  of  codebooks  trained  with  these  new  features 
with  the  previously  computed  feature  codebooks.  Speaker  10,  which  was 
previously  misclassified,  was  now  picked  up  with  the  added  feature  giving  a 
new  score  of  96.2%  correct.  We  computed  the  acceleration  cepstra  by  running  the 
delta  cepstra  features  through  the  same  linear  regression  algorithm  used  to 
compute  the  delta  cepstra. 

We  also  conducted  an  across  the  Great  Divide  experiment  using  the  same  San 
Diego  group.  We  trained  on  sessions  1-3  and  tested  on  sessions  9  and  10.  We 
did  not  have  time  to  compute  the  RASTA  PLP  with  lateral  inhibition  features. 
Table  6  shows  the  best  results  of  this  experiment.  LR  cepstrum  refers  to  the 
RASTA  and  liftered  cepstrum. 
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LPC  Cepstrum 

RASTA  PLP 

Combination 

littered  cepstrum,  delta 
cepstrum,  acceleration 
cepstrum,  LR  cepstrum 

cepstrum,  delta 
cepstrum,  acceleration 
cepstrum 

Number  of  Speakers 
Misclassified 

19 

17 

%  Correct 

26.9% 

34.6% 

Table  6 


The  results  we  obtained  in  these  experiments  again  do  not  agree  with  the 
results  recently  reported  by  Kao  and  his  associates  (see  last  status  report).  Kao  et 
al.  reported  77.88%  correct  recognition  on  the  same  group  with  RASTA  and 
littering  alone.  The  difference  between  their  experiment  and  ours  is  that  they 
also  included  a  nm  using  sessions  6-8  for  training  and  sessions  4  and  5  for 
testing. 

3.2  Green  Flag  Database  Results 

In  this  section  we  discuss  the  results  of  speaker  identification  tests  we 
conducted  from  an  expanded  group  from  the  Green  Flag  database.  In  addition 
to  the  17  speakers  previously  trained  and  tested,  we  added  an  additional  20 
speakers  for  a  new  training  and  testing  session.  The  37  speakers  were  in  8 
different  aircraft  or  were  groimd  controllers.  Table  7  identifies  the  number  of 
speakers  associated  with  each  platform  type. 

As  in  the  previous  experiments,  we  trained  on  the  first  two  transmissions  and 
tested  on  all  subsequent  available  transmissions  from  each  speaker.  The 
following  tables  describe  the  training  and  testing  transmission  length  statistics 
for  this  data  set.  The  speaker  identification  performance  is  shown  in  Table  12. 

Note  that  RASTA  filtering  degraded  overall  performance.  This  result  is 
consistent  with  the  putative  channel  normalizing  effects  of  RASTA  filtering. 
Since  we  assume  the  acoustic  background  in  the  transmissions  are  helping  to 
separate  the  different  speakers,  one  would  expect  that  channel  normalization 
would  degrade  performance. 
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Speakers  in  Platforms 


Number  of  Speakers 

Platform  Type 

1 

C130 

9 

F15 

2 

RF4C 

11 

F16 

4 

F4G 

1 

EA6 

1 

F117 

4 

AlO 

4 

Towers 

Table  7 


Training  Transmission  Length  Statistics  For  37  Transmissions 


stat 

seconds 

max 

7.82 

min 

1.89 

mean 

4.37 

Table  8 


Training  Transmission  Lengths 


Transmission  Length  (tl) 

Number  of  Transmissions 

1.0  <  tl  <2.0 

1 

2.0  <  tl  <3.0 

4 

3.0  <  tl  <4.0 

5 

4.0  <  tl  <5.0 

8 

5.0  <  tl  <6.0 

4 

6.0  <  tl  <7.0 

0 

7.0  <  tl  <8.0 

5 

Total 

37 

Table  9 
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Test  Transmission  Length  Statistics  For  37  Transmissions 


stat 

seconds 

max 

6.75 

min 

0.49 

mean 

2.15 

Table  10 


Test  Transmission  Lengths 


Transmission  Length  (tl) 

Number  of  Transmissions 

0.0  <  tl  <1.0 

6 

1.0  <  tl  <2.0 

56 

2.0  <  tl  <3.0 

29 

3.0  <  tl  <4.0 

13 

4.0  <  tl  <5.0 

4 

5.0  <  tl  <6.0 

2 

6.0  <  tl  <7.0 

1 

Total 

111 

Table  11 


Results 


Feature 

Correct 

liftered  cepstrum 

76.58% 

delta  cepstrum 

58.56% 

acceleration  cepstrum 

56.76% 

RASTA 

37.84% 

cepstrum  &  delta 

82.88% 

cepstrum,  delta,  &  acceleration 

83.78% 

cepstrum,  delta,  &  RASTA 

81.08% 

Table  12 
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3.3  Approaches  to  the  Open  Set  Problem 

In  this  section  we  review  two  possible  approaches  to  the  tactical 
communications  open  set  speaker  identification  problem.  The  concept  of 
operations  assumed  is  on-line  training  of  desired  speakers  and  the  subsequent 
tracking  of  these  speakers  on  a  mission-by-mission  basis.  Given  this  concept  of 
operations,  techniques  such  as  the  Nearest  Neighbor  approach  used  in  ITT's 
speaker  identification  system  are  not  possible.  We  also  do  not  believe  that 
building  out  of  class  models  based  on  previously  recorded  data  is  the 
appropriate  approach  to  investigate.  There  are  two  main  reasons  for  this.  The 
first  has  to  do  with  channel  considerations.  Since  the  channel  changes 
considerably  from  day  to  day,  there  is  no  way  of  determining  whether  the 
generated  out-of-class  set  is  meaningful  giving  a  new  day's  channel  conditions. 
Similarly,  there  is  no  way  of  knowing  whether  the  speakers  chosen  for  the  out-of- 
class  set  will  more  closely  match  the  real  out-of-class  set  or  the  trained  speaker 
set. 

Ideally,  one  would  want  to  generate  rejection  criteria  based  on  the  training 
data.  This  may  be  accomplished  by  the  following  approach,  which  is  very 
similar  to  the  Gaussian  Mixture  Model  (GMM)  approach  used  by  MIT  Lincoln 
Lab.  The  first  step  would  be  to  generate  VQ  codebooks  as  before  using  the 
Lloyd's  Generalized  Algorithm.  Each  codeword  in  each  speaker's  codebook  is 
now  the  mean  of  a  multivariate  Gaussian.  A  diagonal  covariance  matrix  can  then 
be  computed  by  passing  the  training  data  through  each  codebook  one  final  time. 
Boundaries  for  each  multivariate  Gaussian  can  then  be  determined  by  each 
Gaussian's  variance.  For  example,  if  a  data  vector  falls  beyond  the  boundaries 
determined  by  some  5C,  where  C  is  the  covariance  (diagonal)  matrix  of  each 
Gaussian,  then  that  data  vector  is  considered  to  be  outside  the  desired  class.  If 
more  data  vectors  of  a  particular  transmission  fall  outside  those  boundaries  than 
not,  then  the  entire  transmission  is  labeled  as  unknown  or  novel. 

More  precisely,  we  want  to  determine  whether  an  input  vector  x  belongs  to 
class  (speaker)  j,.,  where  i  is  refers  to  the  ith  speaker.  We  do  this  by  maximizing 
the  likelihood  F(xl5,  ).  We  assume  that  each  speaker  is  modeled  by  a  mixture  of  n 
multivariate  Gaussians.  The  mean  vectors  and  diagonal  covariance  matrixes  of 
the  Gaussians  are  determined  by  the  VQ  process  previously  described. 
Therefore,  each  speaker  model  is  described  as 
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where  N  is  the  dimension  of  x;  and  u,  ,.  and  C.  ,  are  the  mean  vector  and 
diagonal  covariance  matrix  of  the  yth  model  Gaussian  for  speaker 
respectively.  In  order  to  reduce  the  computations  required  in  (6),  we  can  take  the 
natural  log  of  (6),  remove  the  minus  sign  within  the  exponential,  and  remove  the 
constants  in  the  denominator.  The  distance  measure  becomes 

=  (x-fi,^jfc^^j(x-fi,^j)  +  ]n{dQtC2j)^  (7) 

Since  the  exponential's  minus  sign  has  been  removed,  the  speaker's  model  that 
minimizes  equation  (7)  wins. 

We  need  assign  the  diagonal  of  to  vector  j  in  order  to  set  the  class 
boundaries  as  previously  discussed.  We  define  the  boundary  as 

Finally,  the  decision  rule  becomes 


X(/)  =  argmin 


otherwise 


If  X(5*)  scores  mostly  oo  for  all  vectors  in  a  transmission,  then  the  test  speaker  is 
labeled  as  an  unknown  speaker. 

The  other  approach  worth  considering  is  to  adapt  the  cohort  method  for 
determining  output  score  thresholds.  Once  the  training  data  set  of  desired 
speakers  has  been  gathered,  the  out-of-class  thresholds  can  be  determined.  This 
is  accomplished  by  computing  the  average  score  for  each  speaker's  model  using 
the  other  speakers  training  data;  we'll  call  these  speakers  the  backgroimd  set.  In 
other  words 

°  k=l 

where  B  is  the  number  of  background  speakers.  The  decision  rule  now  becomes 
0(5*)  =  d(x,Jl,j)  -  d{,x,^^j)  (11) 

0(^* )  >  A  =»  j‘  detected 

''  ’  (12) 

0(j*)  <  A  =>unknown  speaker 
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3.4  Conclusion 

During  this  reporting  period  we  added  another  LPC  cepstrum  derived 
feature,  the  acceleration  cepstrum,  and  found  that  its  addition  improved  overall 
performance.  We  also  conducted  an  expanded  Greenflag  database  experiment 
with  37  speakers  and  obtained  over  83%  recognition  rate  with  111  test 
transmissions.  Finally  we  conducted  an  initial  analysis  of  two  approaches  to  the 
open  set  problem. 

4.0  Status  Report  4 

During  the  reporting  period  of  this  status  report,  we  concentrated  on 
conducting  narrowband  experiments  with  both  the  KING  and  GREENFLAG 
databases.  In  the  Status  Report  dated  May  24, 1993,  we  reported  the  results  of 
short  segment  experiments  we  conducted  with  the  narrow  band  portion  of  the 
KING  database.  We  reran  those  experiments  with  the  wideband  data  to  try  and 
minimize  the  effects  of  the  charmel.  In  addition,  we  made  a  modification  to  the 
early  stages  of  the  cepstrum  and  delta  cepstrum  algorithm  that  has  improved 
overall  performance. 

4.2  Narrowband  Experiments 

In  this  section  we  report  on  the  results  of  bandpass  filtering  all  speech  data  to 
a  500  -  2500  Hz  band.  The  motivation  for  this  was  to  use  the  information  in  the 
speech  waveform  with  the  most  amount  of  energy,  and  to  remove  everything 
else.  Any  channel  information  that  dominates  the  speech,  especially  at  the  higher 
frequencies,  will  be  attenuated  or  removed.  The  details  of  the  bandpass  filter 
design  used  are  shown  in  Section  5.3. 

As  in  previous  KING  database  experiments,  we  computed  the  liftered 
cepstrum,  the  delta  cepstrum  (with  a  seven  point  linear  regression),  the 
acceleration  cepstrum,  and  RASTA  cepstrum  of  the  26  speaker  San  Diego  group. 
However  this  time  we  bandpass  filtered  the  8  kHz  sampled  waveforms  prior  to 
parameterization. 

Table  13  shows  the  results  of  most  of  the  combinations  of  features.  The 
results  shown  were  obtained  through  feature  adjudication  as  explained  in 
previous  reports.  The  features  used  were  enumerated  as  follows:  1)  liftered 
cepstrum,  2)  delta  cepstrum,  3)  acceleration  cepstrum,  and  4)  RASTA  cepstrum. 
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Features 

Number  of  Speakers 
Misclassified 

Percent  Correct 

1 

2 

92.3% 

2 

16 

38.5% 

3 

11 

57.7% 

4 

5 

80.8% 

1&2 

3 

88.5% 

1&3 

3 

88.5% 

1&4 

3 

88.5% 

1,2&3 

3 

88.5% 

1,2&4 

4 

84.6% 

1,3&4 

4 

84.6% 

2&4 

5 

80.8% 

3&4 

6 

76.9% 

1,2,3,&4 

4 

84.6% 

Table  13 


It  is  evident  from  Table  13  that  bandpass  filtering  only  improved  the 
separability  of  the  liftered  cepstrum.  The  new  score  obtained  from  the  liftered 
cepstrum  feature  was  92.3%  as  opposed  to  88.5%  recognition  without  bandpass 
filtering.  However,  without  bandpass  filtering,  the  inclusion  of  the  other  three 
parameters  increased  the  overall  performance  to  96.2%.  This  was  not  the  case 
with  bandpass  filtering;  the  inclusion  of  the  other  features  reduced  overall 
performance. 

Therefore,  what  we  gained  by  bandpass  filtering  in  one  parameter  space  -  the 
liftered  cepstrum  -  we  lost  in  the  other  parameter  spaces.  One  can  conclude  that 
the  derivative  and  RASTA  filtering  operations  have  a  much  greater  effect  on  the 
frequency  channels  outside  of  the  500-2500  Hz  band.  Since  the  cepstrum  gives 
information  relating  to  the  harmonics  of  the  signal  and  there  is  more  energy  in 
the  speech  signal  in  the  low  end  of  the  spectrum  than  the  high  end,  we  posit  that 
the  derivative  and  RASTA  processes  have  their  greatest  influence  on  the  low  end 
of  the  spectrum  -  below  500  Hz.  More  work  is  needed  to  verify  this  assumption. 

We  also  performed  narrow  bandpass  experiments  with  the  GREENFLAG 
database.  This  was  an  important  exercise  since  these  experiments  would  shed 


117 


light  on  how  important  the  background  audio  is  in  identifying  such  short 
segment  transmissions. 

We  found  that  bandlimiting  the  transmissions  of  the  GREENFLAG  database 
produced  an  overall  decrease  in  performance.  The  best  performance  previously 
reported  with  a  37  speaker  closed  set,  with  111  test  transmissions  was 
approximately  84%  correct.  Upon  bandlimiting  the  data  with  the  filter  described 
in  Section  5.3,  the  overall  performance  dropped  to  approximately  75%. 

This  is  an  important  finding.  This  verifies  that  the  audio  background  is  a 
significant  component  in  these  tests.  The  new  performance  is  close  to  what  was 
reported  by  Lincoln  Laboratories  in  an  experiment  to  determine  the  effects  of  the 
background  on  the  overall  performance.  In  that  experiment  the  backgroimd  was 
deconvolved  from  the  speech  waveform.  The  subsequent  speaker  identification 
performance  was  similar  to  what  we  obtained  by  bandpass  filtering.  These 
findings  suggest  that  the  mean  training  length  of  four  seconds  (of  trained  radio 
communicator's  speech)  is  close  to  what  is  required  for  good  performance.  In 
other  words,  since  we  are  achieving  approximately  75%  correct  recognition  by 
reducing  the  effects  of  the  background  audio  and  channel,  we  should  be  near  the 
knee  of  the  performance  versus  training  time  curve. 

With  such  short  training  utterances  the  audio  background  or  "context"  is 
important  additional  information  for  short  utterance  speaker  identification.  The 
flip  side  to  this  finding  is  that  if  a  speaker's  subsequent  transmission  has 
characteristically  different  audio  backgroimd  statistics,  a  mismatch  is  much  more 
likely.  Of  course,  this  is  not  a  surprising  result.  When  channel  conditions  are 
known  to  change  significantly,  the  need  to  train  with  many  examples  of  a 
speaker's  speech  in  the  different  channel  conditions  is  well  understood.  A  very 
important  distinction  must  be  pointed  out  however;  The  need  for  large  amounts  of 
training  data  is  not  required  to  adequately  model  the  speaker;  it  is  required  to  model  the 
changes  in  the  acoustic  parameters  induced  by  varying  channel  conditions. 

4.2  Upcoming  Work 

This  status  report  concludes  our  requirements  for  all  work  imder  the  original 
contract  except  for  a  final  report.  We  will  deliver  the  final  report  at  the  end  of 
the  extension  of  this  program.  During  the  extension  period,  we  will  investigate 
techniques  for  dealing  with  out-of-set  speakers.  In  the  previous  status  report,  we 
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outlined  two  potential  approaches.  To  start,  we  will  investigate  a  simpler 
approach.  This  will  be  to  find  the  average  lowest  mean  square  error  scores  of  the 
N  closest  speakers  to  each  speaker  in  the  database  and  set  a  threshold 
proportional  to  that  score.  This  is  a  good  first  start  since  the  overall  algorithm 
will  not  change  with  this  approach.  If  the  stated  approach  is  not  sufficient,  then 
we  will  investigate  the  methods  explained  in  the  previous  status  report. 

4. 3  Bandpass  Filter  Design 

This  appendix  describes  the  bandpass  filter  design  described  in  Section  2. 
The  filter  was  implemented  using  Entropic's  ESPS  filter  routine  iir_filt.  The 
following  is  a  copy  of  the  parameter  file  used  to  design  the  filter.  The  variables 
pass_band_loss  and  stop_band_loss  are  in  defined  in  dB.  The  variables  s_freql 
and  s_freq2  are  the  stopband  frequencies  and  p_freql  and  p_freq2  define  the 
passband  frequencies.  The  other  variables  are  self  explanatory. 

float  samp_freq  =  8000; 
float  gain  =  1.0; 

string  filt.method  =  "BUTTERWORTH"; 

string  filt_type  =  "BP"; 

float  pass_band_loss  =  3; 

float  stop_band_loss  =  7; 

float  s_freql  =  400; 

float  p_freql  =  500; 

float  p_freq2  =  2500; 

float  s_freq2  =  2600; 

The  following  are  the  filter  polynomial  coefficients  and  their  respective  zero- 
pole  values.  These  were  produced  from  the  ESPS  routine  filtspec.  The  zeros  and 
poles  are  listed  as  complex  valued  numbers  in  the  form  [real  part,  imaginary  part]. 

Record  1:  num_size:  15,  denom_size:  15,  zero_dim;  14,  pole_dim:  8 

re_num_coeff: 

0:  0.016587  0  -0.11611  0  0.34833 
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5:  0  -0.58055  0  0.58055  0 

10:  -0.34833  0  0.11611  0  -0.016587 

re_denom_coeff: 

0:  1  -3.7875  6.4156  -7.2633  7.4114 

5:  -6.9468  5.1075  -2.9843  1.6353  -0.80399 

10:  0.28173  -0.073536  0.021143  -0.0041528  -1.447e-06 

zeros: 

0:  [  -1,  0]  [  1,  0] 

2:[  -1,  0][  1,  0] 

4:  [  -1,  0]  [  1,  0] 

6:  [  -1,  0]  [  h  0] 

8:  [  -1,  0]  [  1,  0] 

10:  [  -1,  0]  [  1,  0] 

12:  [  -1,  0]  [  1,  0] 

poles: 

0:  [  -0.32454,  0.78778]  [  0.86557,  0.35619] 

2:  [  -0.21493,  0.55368]  [  0.75601,  0.29306] 

4:  [  -0.10124,  0.32377]  [  0.64233,  0.20025] 

6:  [-0.00034782,  0]  [  0.54144,  0] 

4.4  Improvements  To  Speaker  Identification  Algorithms  Under  Booz* Allen  &  Hamilton 
Investment 

Booz*  Allen  is  independently  interested  in  speaker  identification.  Under 
Booz*  Allen  funding,  the  Principal  Investigator  worked  for  approximately  one 
month  to  speed  up  the  computation  time  of  the  speaker  identification  algorithms. 
In  addition,  an  X  window  demonstration  and  interface  was  developed.  These 
changes  are  described  in  this  section. 

A  change  in  the  up  front  processing  stage  has  produced  an  unexpected 
performance  improvement  in  the  overall  speaker  identification  performance. 
This  occurred  while  trying  to  decrease  overall  processing  time  without 
degrading  identification  performance. 
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In  the  past,  every  frame  in  the  transmission  was  processed  (e.g.,  the  cepstra, 
delta  cepstra,  and  RASTA  cepstra  was  computed  for  each  frame).  Afterwards,  a 
voice,  no-voice  decision  was  made  for  each  frame  and  only  voiced  frames  of  each 
parameter  was  kept  for  either  training  or  testing.  This  appeared  to  be  a 
tremendous  waste  in  processing  since  most  of  the  frames  in  an  interval  of  natural 
speech  is  silence  (or  background).  However,  since  the  derivatives  and  RASTA 
filtering  of  the  cepstral  parameters  is  required  in  the  process,  it  did  not  make 
sense  to  extract  pieces  of  the  transmissions  and  concatenate  disjoint  pieces  of  the 
communication. 

Notwithstanding  the  intuitive  reasons  for  not  extracting  the  imvoiced  sections 
of  speech  prior  to  further  processing,  we  did  it  anyway  and  tested  to  see  how  it 
affected  overall  performance.  We  were  surprised  to  find  out  that  this  method 
actually  improved  performance,  and  in  some  cases  significantly. 

With  the  new  and  faster  algorithm  we  achieved  100%  recognition  of  the  San 
Diego  group  in  the  KING  database.  However  the  combination  of  features  was 
different.  The  best  performance  prior  to  the  algorithm  change  was  96.2% 
recognition  with  four  features:  littered  cepstra,  delta  cepstra,  acceleration 
cepstra,  and  RASTA-liftered  cepstra.  With  the  new  algorithm  100%  recognition 
was  achieved  with  the  RASTA-liftered  cepstra  and  the  delta  cepstra. 

Even  more  significar\t  results  were  obtained  when  we  retested  the 
GREENFLAG  database.  The  best  performance  previously  obtained  was  83.78% 
recognition.  With  the  faster  algorithm  we  obtained  nearly  95%  recognition.  In 
addition,  Rome  Laboratory  provided  us  with  additional  speakers  and 
transmissioirs,  increasing  the  database  to  41  speakers  and  a  total  of  159  test 
transmissions.  We  obtained  nearly  94%  recognition  performance  with  the 
updated  database  in  a  new  closed  set  test. 

It  is  not  entirely  clear  why  the  change  to  the  algorithm  previously  described 
improves  the  overall  recognition  performance.  What  we  know,  however,  is  that 
the  transition  botmdaries  between  the  voiced  and  unvoiced  sections  of  speech 
hamper  the  recognition  process  if  included.  We  also  know  that  the  littered 
cepstral  features  are  identical  in  either  process.  What  changes,  of  course,  are  the 
boundary  areas  in  the  delta,  acceleration,  and  RASTA  features.  It  appears  that 
vocal  tract  transitions  from  voiced  to  voiced  areas  of  speech  are  the  important 
pieces  of  information  vis-a-vis  the  derivative  and  filtering  operations. 
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In  addition  to  the  algorithm  change  described  above,  we  implemented  the 
entire  algorithm  in  C.  The  ESPS/Waves  license  is  now  not  needed  to  run  any  of 
the  speaker  recognition  functions  developed  to  date.  This  has  greatly  decreased 
the  computation  time,  however,  we  are  not  yet  able  to  achieve  real-time 
performance.  The  VQ  training  and  testing  portions  of  the  algorithm  require  the 
most  amount  of  time.  We  are  implementing  these  algorithms  on  the  Adaptive 
Solutions  CNAPs  SIMD  parallel  processor  in  order  to  achieve  real-time  training 
and  testing.  With  this  capability,  we  will  be  able  to  conduct  many  more  tests  in 
the  upcoming  program  extension  than  we  are  now  capable  of  performing. 
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Appendix  9 


Software  User's  Manual 


Parameter  Computation 

All  parameters  are  computed  using  four  major  C  programs.  These  include: 

a)  iojyverlap.c 

b)  get_ceps.c 

c)  get_dceps.c 

d)  get_rasta.c 

The  iojoverlap.c  program  segments  the  sampled  transmissions  into  overlapped 
frames  of  speech.  The  get_ceps.c,  getjiceps.c,  and  the  getjrasta.c  programs 
compute  the  cepstrum,  delta  cepstrum,  and  RASTA  cepstrum  respectively. 

At  the  C  shell  prompt,  the  programs  are  called  as  follows; 

cat  <file>  I  io_overlap  <window  width;  in  samples>  <amount  of  widow  slide;  \ 
in  samples>  I  get_ceps  <window  width>  <lpc  order>  <power_threshold>  \ 
<window?>  >  <cepsfile> 

The  argument  <file>  is  the  name  of  a  raw  binary  input  transmission  file.  If  the 
input  file  has  an  Entropic  header,  then  replace  the  cat  <file>  command  with  the 
following  ESPS  command:  bhd  <file.sd>.  The  bhd  command  strips  the  header 
from  the  input  file.  The  last  argument  to  the  getjceps  program,  <window?>,  is  a 
flag  for  liftering  the  cepstral  coefficients;  use  1  if  you  want  the  liftered  cepstrum, 
0  if  you  want  the  plain  cepstrum.  All  other  arguments  are  self  explanatory. 

To  compute  the  delta  cepstrum  call  getjiceps  as  follows: 

cat  <ceps  file>  I  get_dceps  <lpc  order>  <buffer  size>  <derivative  points>  >  \ 
<delta  ceps  file> 

The  argument  <buffer  size>  refers  to  the  munber  of  cepstral  vectors  to  read 
from  the  input  stream  and  <derivative  points>  is  the  number  of  points  to  use  for 
computing  the  regression.  The  call  to  the  RASTA  filtering  program  is  similar: 
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cat  <ceps  file>  I  get_rasta  <lpc  order>  <buffer  size>  >  <rasta  ceps  file> 
Training 

Once  all  the  parameters  have  been  obtained,  train  the  codebooks  for  each 
parameter  as  follows: 

lbg_vq  <codebook  file  name>  <parameter  file>  <lpc  order>  <codebook  size>  \ 

<converge  threshold>  <max  iterations>  <split  criteria> 

The  argument  <con verge  threshold>  refers  to  the  average  codebook  MSE 
change  from  the  previous  LEG  iteration  (see  Appendbc  4)  and  <max  iterations>  is 
the  maximum  number  of  LEG  iterations  allowed  after  each  codeword  split.  The 
argument  <split  criteria>  refers  to  the  criteria  for  splitting  individual  codewords 
after  binary  codebook  splitting  is  no  longer  possible.  Use  0  for  the  density 
criterion  and  1  for  the  MSE  criterion. 

Testing 

For  closed  set  testing  use  the  program  classify.c  as  follows: 

cat  <parameter  file>  I  classify  <codebook  file>  <lpc  order>  \ 

<codebook  size>  <number  of  speakers>  >  <outputfile> 

C  Shell  Scripts 

To  process  many  files  in  batch  mode  we  used  several  C  shell  scripts  in  both 
KING  and  GREENFLAG  experiments.  In  the  top  level  KING  directory  we  used 
make_all_params.csh,  go_train.csh,  and  go_test.csh.  These  C  shell  programs 
iterate  through  the  appropriate  training  and  testing  files  and  compute  all 
parameters,  train  all  codebooks,  and  finally  process  all  test  files. 

Likewise,  in  the  top  level  GREENFLAG  directory,  make_train_params.csh, 
make_test_params.csh,  go_train.csh,  and  go_test.csh  perform  the  functions 
named  by  the  program.  The  go_test.csh  program  performs  a  closed-set  test. 
There  are  several  scripts  that  compute  open-set  tests  using  the  different  out-of-set 
criteria  discussed  in  this  report.  The  name  of  these  scripts  and  the  out-of-set 
metric  they  use  follows: 
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a)  jackjcnife.csh;  N  Closest  Speakers'  Scores  Thresholds 

b)  jackjcnife.thresh.cmps.csh;  Global  Thresholds 

c)  jackjcnife.cohortcnaps.csh;  Cohort  Normalized  Thresholds 

Scripts  b)  and  c)  call  programs  compiled  on  the  CNAPS  parallel  processing 
server.  The  script  jack_knife,csh  calls  the  Sun  workstation  C  programs  we 
described  in  the  previous  sections  of  this  Appendbc.  Any  of  the  C  shell  scripts 
can  be  modified  to  process  other  databases  in  different  directories. 
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MISSION 

OF 

ROME  LABORATORY 


Mission.  The  mission  of  Rome  Laboratory  is  to  advance  the  science  and 
technologies  of  command,  control,  communications  and  intelligence  and  to 
transition  them  into  systems  to  meet  customer  needs.  To  achieve  this, 
Rome  Lab: 


a.  Conducts  vigorous  research,  development  and  test  programs  in  all 
applicable  technologies; 

b.  Transitions  technology  to  current  and  future  systems  to  improve 
operational  capability,  readiness,  and  supportability; 

c.  Provides  a  full  range  of  technical  support  to  Air  Force  Materiel 
Command  product  centers  and  other  Air  Force  organizations; 

d.  Promotes  transfer  of  technology  to  the  private  sector; 

e.  Maintains  leading  edge  technological  expertise  in  the  areas  of 
surveillance,  communications,  command  and  control,  intelligence,  reliability 
science,  electro-magnetic  technology,  photonics,  signal  processing,  and 
computational  science. 


The  thrust  areas  of  technical  competence  indude:  Sunreillance, 
Communications,  Command  and  Corrtrol,  Intelligence,  Signal  Processing, 
Computer  Sdencs  and  Technology,  Bectromagnetic  Technology, 
Photonics  and  Reiiabiitty  Sdences. 


